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ABSTRACT 

This  annual  report  briefly  describes  progress  on  research  in  algorithms  for 
optimal  control  problems.  The  principal  research  focus  has  been  on  a  new  approach 
to  the  parallel  implementation  of  iterative  algorithms  for  optimal  control  based  on  a 
two  level  parametrization  of  optimality  conditions,  and  a  secondary  research  focus  has 
been  the  investigation  of  fault  detection  in  the  type  of  computational  networks  used 
for  optimal  control  computations.  Publications  describing  the  results  in  detail  are 
listed. 
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1.  RESEARCH  OBJECTIVES  AND  STATUS 

The  principal  focus  of  our  research  is  a  new  systematic  approach  to  design 
optimal  control  algorithms  that  may  be  implemented  on  parallel  machines.  This 
approach  is  based  on  a  two-ievel  parametrization  of  first-order  optimality  conditions. 
The  first  level  of  parametrization  is  concerned  with  the  decrease  of  the  overall  amount 
of  operation,  and  the  second  level  is  concerned  with  parallelism.  By  introducing 
parametrization  matrices  in  the  first  level  and  then  factoring  those  matrices  to  exhibit 
the  amount  of  parallelism  desired  in  the  second  level  as  a  function  of  tne  number  of 
processing  elements  to  be  used,  the  resulting  optimality  conditions  may  be  tailored  to 
the  computing  network  on  which  the  computations  are  to  be  performed.  The  results 
have  been  published  in  the  Journal  of  Parallel  and  Distributed  Computing  [1],  and 
have  been  presented  at  the  1987  Annual  International  Conference  on  Parallel  Process¬ 
ing  [5],  the  1987  Allerton  Conference  on  Communication,  Control  and  Computing  [6], 
the  Third  SIAM  Conference  on  Parallel  Processing  for  Scientific  Computing  [7],  and 
are  also  the  subject  of  L.  J.  Podrazik’s  Ph.  D.  dissertation  [8].  The  research  results 
concerning  the  convergence  properties  of  relaxation  algorithms  that  are  used  in  paral¬ 
lel  schemes  have  been  published  in  Mathematical  Programming  [2]. 

The  second  research  focus  has  been  the  investigation  of  fault  detection  in  compu¬ 
tational  networks  of  the  type  analyzed  in  the  course  of  our  investigation  of  parallelism 
for  optimal  control.  We  have  concentrated  our  effort  in  the  study  of  system  level  fault 
models,and  the  results  have  been  accepted  for  publication  in  the  IEEE  Transactions 
on  Computers  [4],  and  are  also  the  subject  of  M.  A.  Kennedy’s  Ph.  D.  dissertation  [3]. 
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2.  PH.D.  COMPLETED:  M.A.  KENNEDY,  MAY  1987 
A  STRUCTURAL  APPROACH  TO  A  SYSTEM  LEVEL  FAULT  MODEL 

ABSTRACT 

The  widespread  use  of  computers,  both  large  and  small,  has  lead  to  an  increase  in 
the  fault  problem.  This  problem  is  most  acute  while  the  system  is  operating,  because 
testing  and  fault  diagnosis  may  not  be  possible  during  operation.  One  method  of 
addressing  this  problem  is  to  use  the  processing  power  of  the  system  itself  to  enhance 
its  ability  to  diagnose  faults.  System  level  fault  models  provide  a  framework  for 
addressing  this  problem.  These  models  represent  a  system  in  terms  of  its  constituent 
processing  elements,  its  faults,  the  tests  to  identify  the  faults,  and  the  relationship 
between  the  faults  and  the  test  outcomes.  This  work  considers  the  system  level  fault 
model  of  Preparata,  Metze,  and  Chien  which  envisions  a  multiple  computer  system  as 
a  collection  of  processing  elements  and  test  links.  The  focus  of  the  work  is  the  rela¬ 
tionship  between  the  test  link  structure  and  the  system  diagnosis  properties.  Results 
include  a  test-link  based  method  for  partitioning  the  processing  elements  that  provides 
both  a  new  measure  for  comparing  systems  and  an  indication  of  the  complexity  of 
identifying  the  maximum  diagnosability  number  of  a  system.  This  partitioning  concept 
leads  to  new  diagnosability  conditions  that  fill  in  the  gap  between  existing  diagnosabil¬ 
ity  conditions  and  their  relationship  to  properties  of  the  test  link  structure.  The  parti¬ 
tion  is  also  used  to  synthesize  improved  algorithms  for  identifying  the  maximum  diag¬ 
nosability  number  of  a  system.  Turning  to  implied  faulty  set  properties  useful  in  diag¬ 
nosis,  results  for  both  constrained  and  unconstrained  system  structures  are  presented. 
Finally,  these  properties  are  incorporated  into  diagnosis  algorithms. 
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3.  F’n.D.  COMPLETED:  L  J.  PODRAZIK,  DECEMBER  1987 

PAR  AT  I  FT  IMPLEMENTATIONS  OF  GRADIENT  BASED  ITERATIVE  ALGOR1TH  MS 
FOR  OPTIMAL  CONTROL  PROBLEMS 

ABSTRACT 

The  primary  objective  of  this  research  is  to  develop  new  parallel  techniques  for 
solving  optimal  control  problems  that  occur  in  online  real-time  applications.  In  view 
of  the  availability  of  inexpensive  yet  powerful  hardware,  the  use  of  parallel  processing 
techniques  is  proposed  to  satisfy  both  the  speed  constraints  imposed  by  a  real-time 
setting  as  well  as  the  reliability  requirements  of  an  online  system.  Unlike  previous 
parallel  approaches  to  the  solution  of  optimal  control  problems,  the  goal  is  to  obtain 
an  efficient  solution  by  structuring  the  control  algorithms  to  exhibit  parallelism  which 
match  the  given  machine  architecture.  In  order  to  achieve  the  goal,  this  work  reexam¬ 
ines  optimal  control  problems  from  the  perspective  of  their  first-order  optimality  con¬ 
ditions  so  that  the  issues  of  parallelism  and  machine  architecture  may  be  considered  in 
the  forefront  of  the  algorithm  synthesis.  Results  include  the  development  of  an 
efficient  parallel  procedure  for  gradient  evaluation.  Embedded  in  the  parallel  gra¬ 
dient  evaluation  procedure  is  a  new  technique  for  solving  first-order  linear  recurrence 
systems  which  is  synthesized  as  a  function  of  the  number  of  available  computers.  The 
synthesis  approach  for  parallel  recurrence  solvers  is  also  new  and  uses  matrix  factori¬ 
zation  techniques  to  organize  the  computations  for  the  given  parallel  environment. 

The  results  for  parallel  gradient  evaluation  are  then  exploited  to  produce  efficient 
parallel  implementations  of  iterative  gradient  based  techniques  to  solve  the  linear  qua¬ 
dratic  regulator  optimal  control  problem  with  hard  control  bounds.  Finally,  a  practical 
multi-computer  architecture  is  presented  to  provide  an  integrated  parallel  environ¬ 
ment  for  the  solution  of  time  critical  optimal  control  problems. 


-7- 


4.  PUBLICATIONS:  JANUARY  1987  -  DECEMBER  1987 

[1]  G  G.  L  Meyer  and  L.  J.  Podrazik,  A  Parallel  First-Order  Linear 
Recurrence  Solver,  J.  Parallel  and  Distributed  Computing,  vol.  4,  1987,  pp. 
117-132. 

[2]  G.  G.  L.  Meyer,  Convergence  of  Relaxation  Algorithms  by  Averaging, 
Mathematical  Programming  vol.  40,  no.  1,  1988. 

[3]  M.  A  Kennedy,  A  Structural  Approach  to  a  System  Level  Fault  Model,  a 
dissertation  submitted  the  Johns  Hopkins  University  in  conformity  with  the 
requirements  for  the  degree  of  Doctor  of  Philosophy,  May  1987. 

[4]  G.  G.  L  Meyer  and  M.  A  Kennedy,  The  PMC  System  Level  Fault  Model: 
Cardinality  Properties  of  the  Implied  Faulty  Sets,  IEEE  Trans,  on  Comput¬ 
ers,  accepted  for  publication  August  1987. 

[5]  G.  G.  L  Meyer  and  L.  J.  Podrazik,  Parallel  Implementations  of  Gradient 
Based  Iterative  Agorithms  for  a  Class  of  Discrete  Optimal  Control  Prob¬ 
lems,  Proc.  1987  International  Conference  on  Parallel  Processing  St.  Charles, 
Illinois,  August  17-21,  1987. 

[t>]  G.  G.  L.  Meyer  and  L  J.  Podrazik,  Parallel  Gradient  Projection  Agorithms 
to  Solve  the  Discrete  LQR  Optimal  Control  Problem  with  Hard  Control 
Bounds,  Proc.  Twenty-Fifth  Annual  Ailerton  Conference  on  Communication, 
Control  and  Computing  Ailerton  House,  Monticeiio,  Illinois,  September  30- 
October  2,  1987. 

[7]  G.  G.  L  Meyer  and  L.  J.  Podrazik,  Parallel  Iterative  Agorithms  to  Solve  the 
Discrete  LQR  Optimal  Control  Problem  with  Hard  Control  Bounds,  Third 
SIAM  Conference  on  Parallel  Processing  for  Scientific  Computing  Los 


-8- 


Angeles,  California,  December  1-4,  1987. 

[8]  L  J.  Podrazik,  Parallel  Implementations  of  Gradient  Based  Iterative  Algo¬ 
rithms  for  Optimal  Control  Problems,  a  dissertation  submitted  the  Johns 
Hopkins  University  in  conformity  with  the  requirements  for  the  degree  of 
Doctor  of  Philosophy,  December  1987. 


-9- 


5.  PERSONNEL 

Principal  ‘iivestigators:  G.  G.  L  Meyer  and  H.  L.  Weinert 

Research  Assistants  (Ph.D.  Graduate  Students) 

A.  Balogh 
.A  Harrington 
M.  A  Kennedy 
L.  J.  PodraziL 
V.  K.  Marianov 


-  10- 


6.  APPENDIX  I 


JOL'R.VAL  OF  PARALLEL  AND  DISTRIBUTED  COMPUTING  4,  I  IT-1  [9K7| 
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In  this  paper  we  preseni  a  parallel  procedure  for  the  solution  of  first-order  linear 
recurrence  s> stems  of  size  \  when  the  number  of  processors  p  is  small  in  relation  to 
.V  We  show  that  when  I  <  p2  «  V.  a  first-order  linear  recurrence  system  of  size  Yean 
be  solved  in  5IY  -  I  )/(p  -  1 1  steps  on  a  p  processor  SIMD  machine  and  at  most 
>1 A  -  i)/|p  -  t)  steps  on  a  p  p.occssor  MIMD  machine.  Asa  special  case,  we  further 
show  that  our  approach  precisely  achieves  the  lower  bound  2(Y  -  ll/(p  -  1 1  for 
solving  the  parallel  prefix  problem  on  a /> processor  machine.  <  m-  Academic  Pro.  tnc 


I.  Introduction 

In  this  paper  we  present  a  parallel  algorithm  to  solve  the  well-known  first- 
order  linear  recurrence  system  R(\.  1  >  when  the  number  of  processors  p  is 
small  in  relation  to  V,  and  where  R(\.  1  >  is  defined  as  follows: 

R<\,  1  >:  Given  .V,  given  b  =  {b\.  bz . isl¬ 
and  given  a  =  (a:.  a3 . as). 

compute  v  =  {.x, .  .v: . xs) 

such  that  -T,  =  b\  and 

x,  =  a,x,- 1  +  b,  for  i  =  2.  3 . V. 

We  present  both  SIMD  and  MIMD  versions  of  the  algorithm.  We  analyze 
the  SIMD  version  by  first  considering  a  simplified  shared  memory  model  of 
parallel  computation  that  facilitates  comparison  with  previous  work.  In  that 
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model  the  parallelism  exhibited  by  the  proposed  algorithm  is  examined  in 
terms  of  data  dependencies  only,  therefore  allowing  us  to  determine  the  ide- 
afzed  performance  of  the  procedure.  We  then  consider  a  second  model  of 
computation  which  consists  of  a  S1MD  p  processor  nng  conhguration  with 
a  broadcasting  capability.  In  that  model  interprocessor  communication  is 
taken  into  account,  and  a  more  realistic  analysis  of  the  algorithm  is  per¬ 
formed.  The  MI.MD  version  of  the  algorithm  is  analyzed  by  considering  t^e 
same  simplified  model  as  in  the  SIM D  case,  with  the  exception  that  the  same 
operation  need  not  be  performed  by  all  processors  at  the  same  time.  Finally, 
we  bserve  that  the  algorithm  can  be  mapped  efficiently  to  a  MI.MD/?  proces¬ 
sor  ring  ,'onfiguration  with  a  broadcasting  capability . 

Many  algorithms  have  been  proposed  to  solve  linear  recurrences  in  paral¬ 
lel.  each  with  different  objectives.  Earlier  results  assumed  the  availability  of 
an  unlimited  number  of  processing  elements  and  were  concerned  with  deter¬ 
mining  the  number  of  processors  necessary'  to  achieve  minimal  computa¬ 
tional  time  [3,  6.  8.  10],  Later,  limited  processor  solutions  were  considered 
Chen  ei  a!.  [4]  presented  a  SIMD  algorithm  that  solved  A/th-order  sy  stems  in 
(.V/p)(2A/:  +  3.V/)  -r  <3(A/:log;(/?/A/)]  steps,  but  did  not  discuss  any  specific 
parallel  implementation.  Gajski  [5]  improved  upon  this  result  by  performing 
the  SIMD  computation  in  less  than  (\/n)(2\f:  +  3  V/)  steps  using  p  «  V  2 
processors  in  a  shared  memory  architecture.  In  this  paper  we  show  that  by 
using  a  SIMD  p  processor  ring  network  modified  to  support  global  broadcast¬ 
ing.  the  number  of  steps  required  to  solve  a  nrst-order  linear  recurrence  of 
size  .V  2s  pr  is  5(.V  -  1  )/(p  +  1 ).  This  improves  upon  the  results  of  [4.  5]  for 
the  first-order  case  when  .V  >  pr.  Our  approach  is  a  generalization  of  the 
matrix  factorization  technique  presented  in  [9],  and  it  reduces  to  the  SIMD 
procedure  presented  in  [5]  when  S  =  p2. 

Moreover,  when  a,  =  1,  for  all  /  in  [1.  2 . V],  R(\\  1>  reduces  to  a 

particular  form  of  the  parallel  prefix  problem.  For  .V  >  p.  Kruskal  et  al.  p] 
present  an  algorithm  which  solves  the  parallel  prefix  problem  in  2S'p 
+•  2  log;/?  -  2  steps.  Snir  [II]  improves  upon  this  approach  when  A  2=  p2  by 
solving  the  problem  in  2N/(p  +  I )  +  0(  1 )  steps,  which  is  within  a  constan' 
additive  term  of  the  lower  bound.  In  the  case  oi  parallel  prefix,  we  show  that 
our  algorithm  precisely  achieves  the  lower  bound  2(A'  -  1 )/(/?+  I )  estab¬ 
lished  by  Snir  [11]  when  S  ^  p2. 

Carlson  and  Sugla  [2]  considered  mapping  the  computation  of  first-order 
linear  recurrence  systems  to  perfect  shuffle  and  cube-connected  systems.  A 
common  feature  of  Carlson’s  algorithm  and  our  algorithm  is  that  the  compu¬ 
tation  is  organized  so  that  the  transfer  of  input  and  output  data  can  be  per¬ 
formed  concurrently  with  the  execution  of  the  algorithm,  thus  providing  a 
balance  between  I/O  and  processing  loads. 

An  important  problem  related  to  the  parallel  solution  of  R(S.  1  is  the 
parallel  evaluation  of  general  arithmetic  expressions  [  1  ].  In  the  case  of  first- 
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order  linear  recurrences  of  size  .V,  the  problem  is  to  efficiently  evaluate  vs  in 
parallel  using  p  processors.  In  that  case,  we  observe  that  when  applied  to 
computing  .v.%-  only,  a  straightforward  application  of  our  approach  does  not 
improve  upon  existing  results  [1.4]. 

In  order  to  specify  the  parallelism  exhibited  by  our  algonthms.  we  aug¬ 
ment  those  statements  which  can  be  executed  in  parallel.  We  use  the  syntax 

FORALL  /  6  5  DO  IN  PARALLEL 

BODY  /*  Comments*/ 

END  FORALL. 

which  indicates  that  the  BODY  may  be  executed  concurrently  for  each  i  in 
the  set  S. 

The  SIMD  procedure  to  solve  R<  .V,  1  >  is  presented  in  Section  11:  in  Section 
III  we  discuss  a  parallel  implementation  of  the  algonthm.  and  in  Section  IV 
we  present  a  M1MD  version  of  our  algorithm  to  solve  R(. V,  1  >.  Finally.  Sec¬ 
tion  V  comprises  our  conclusions. 


II.  The  SIMD  ALGORITHM 

The  abstract  model  of  SIMD  parallel  computation  (Fig.  I )  considered  in 
this  section  consists  of  a  global  parallel  memory,  p  parallel  processors,  and  an 
interconnection  network,  where  all  processors  perform  the  same  operation 
at  each  time  step.  We  further  simplify  the  model  by  making  the  following 
assumptions: 

Al.  Each  arithmetic  operation  (addition  or  multiplication)  is  per¬ 
formed  in  unit  time,  referred  to  as  a  step. 

A2.  There  are  no  accessing  conflicts  in  global  memory :  furthermore,  all 
data  are  assumed  to  reside  in  global  memory  initially. 

A3.  There  is  no  time  required  to  access  global  memory. 

This  simplistic  model  allows  the  parallelism  of  the  proposed  algonthm  to 
be  analyzed  without  introducing  the  added  complexity  of  the  implementa¬ 
tion.  In  Section  III  we  map  the  algorithm  to  a  specific  computational  net¬ 
work  and  we  analyze  the  corresponding  implementation. 

Given  A  and  p,  our  approach  to  solving  R(S\  1 )  consists  of  panitiomng 
R(S.  1)  into  a  sequence  of  (TV  -  \)/(p1  -  1)1  reduced  recurrence  systems 
R'  n.  1 ),  each  of  size  n  =  p:.  except  for  the  last  recurrence  system,  w  hich  may¬ 
be  of  size  less  than  pr.  Each  R(n,  1 )  is  then  solved  in  parallel  with  its  initial 
value  taken  as  the  final  value  obtained  from  the  previously  solved  reduced 
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Fig.  I  The  abstract  parallel  computational  model. 


recurrence  system,  except  for  the  first  recurrence  that  uses  .r,  =  b{ .  The  first 
R(n.  1 )  is  solved  as  follows:  (i)  Each  of  the  p  processors  concurrently  com¬ 
putes  a  partial  solution  for  a  different  .v,;  (ii)  after  p  parallel  iterations  pz 

partial  solutions  have  been  computed,  one  for  each  x,.  i  in  [1.  2 . /r], 

where  the  partial  solutions  for*,,  i  in  [  1. 2 . p).  are  precisely  the  solutions 

for  R(N,  1);  (iii)  based  upon  xp  the  next  p  partial  solutions  x,,  i  in  [p  +  1. 

p  +  2 . 2 p],  are  then  updated  in  parallel  to  their  correct  values.  After 

p  -  1  parallel  update  iterations  R(n,  1)  is  solved.  The  next  reduced  recur¬ 
rence  system  of  size  p2  is  then  solved  with  xj  as  its  initial  value.  We  continue 
in  this  manner  until  the  last  R(n,  1)  is  solved.  Since  the  initial  and  final 
values  of  each  R(n,  1)  overlap,  the  complete  solution  of  R(X,  1)  requires 
solving  f(/V-  l)/(/r  -  1  )1  reduced  recurrence  systems. 

We  now  describe  the  SIMD  algorithm  to  solve  R( N.  1 )  when  (,V  -  1  )/(/r 
-  1 )  is  an  integer.  For  «  in  [0,  1  )/(p1  -  1 )- 1  ],  let  the  index  sets 

and  T„  be  defined  as 


S„  =  { 1  +  mp  «(/r  —  1 ) :  m  =  1,2 


P~  1} 
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and 

7;  =  ;  1  +  mp  +  uj(  p2  -  1 ) :  m  =  0.  1 . p  -  1 ; . 

1.  PROCEDURE  R(\.  p.  a.  b) 

2.  x,  •=/>, 

3.  FORu.':=OTO(.V-  \)/(p'-  -  11-  I  DO  /*  Solve  each  R  n.  1  *; 

4.  FORALL / 6 5.  DO  IN  PARALLEL  *  Begin  Coefficient Computation  Phase" 

5.  T(i.  1]  :=  a, : 

6.  END  FORALL 

7.  FOR  r-  I  TO <p-  1 )  DO 

8.  FORALL; 6  5,  DO  IN  PARALLEL 

4.  4[(  +  ;.;l:=ii,. 

10.  END  FORALL 

[1.  END  FOR  /•  End  Coefficient  Compulation  Phase  */ 

12.  FORALL  /  6  5.  DO  IN  PARALLEL  /*  Begin  Partial  Solution  Phase  */ 

13.  x,'.=  b,: 

14  END  FORALL 

15  FOR  1 :  =  1  TO  ( p  -  1 1  DO 

16.  FORALL;  6  32  DO  IN  PARALLEL 

17.  .v,.; :=  *  h:.  : 

18.  END  FORALL 

19  END  FOR  /*  End  Partial  Solution  Phase  */ 

20.  FORiGS^IX)  /*  Begin  Solution  Update  Phase  */ 

21.  FORALL ; :  =  0  TO  ( p  -  1 1  DO  IN  PARALLEL 

22.  x,.;  :=  .4[i  f ),  1]  v,-,  *■  x,.,: 

23.  END  FORALL 

24.  END  FOR  /*  End  Solution  Update  Phase  */ 

25.  END  FOR 

26  END  PROCEDURE 

The  preceding  algorithm  sequentially  solves  (.V  -  l  )/(p2  -  1 )  reduced  re¬ 
currence  systems  of  size  p2,  each  in  parallel.  Each  reduced  system  is  solved 
in  three  phases:  the  coefficient  computation  phase  consists  of  the  execution 
of  loops  4  and  7  and  computes  all  coefficients  of  the  form  a, •  -  a, 
which  are  needed  later  during  the  solution  updates:  the  partial  solution  phase 
consists  of  the  execution  of  loops  12  and  15  and  computes  p 2  partial  solu¬ 
tions.  in  which  the  first  p  partial  solutions  are  the  actual  solutions:  and  finally, 
the  solution  update  phase  consists  of  the  execution  of  loop  20  in  which  the 
coefficients  computed  in  the  first  phase  are  used  to  update  the  next  p  esti¬ 
mates  at  each  iteration.  The  complete  solution  to  R(\.  1  >  is  therefore  ob¬ 
tained  after  executing  (,V  -  l)/(p:  -  1)  iterations  of  loop  3.  An  example 
illustrating  the  computations  performed  by  the  algorithm  is  given  in  Fig.  2 
for  the  case  N  =  17  and  p  =  3,  where  the  notation  .v,  is  used  to  indicate  a 
correct  value  for  the  solution.  Note  that  each  computational  level  may  be 
performed  in  parallel  using  at  most  three  processors. 

Our  model  assumptions  imply  that  we  need  only  consider  computational 
operations  when  determining  the  number  of  steps  required  by  the  algorithm. 
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Therefore  we  must  examine  those  computations  performed  in  loops  7.  15. 
and  20.  The  execution  of  loop  7  is  performed  p  -  1  times,  each  iteration 
requiring  p  -  1  processors  to  perform  a  single  multiplication  concurrently: 
thus  loop  7  requires p  -  1  steps.  Both  loops  1 5  and  20  are  iterated  p  -  1  times, 
each  iteration  requiring  p  processors  to  concurrently  perform  a  multiplica¬ 
tion  followed  by  an  addition.  Thus,  loops  15  and  20  each  require  2 (p  -  1) 
steps.  The  total  number  of  steps  required  to  solve  each  reduced  recurrence 
system  Rip1,  1 )  using p  processors  is  therefore  5 (p-  1 ),  and  hence  the  result¬ 
ing  theorem  follows. 

Theorem  1.  Given  X  and  p  such  that  (.V  -  1 )/( p2  —  1)  is  an  integer,  the 
number  of  steps  required  to  solve  the  linear  recurrence  system  R(X.  I )  using 
a  SIMD  parallel  computer  with  p  processors  is  5(.V  -  I  )/(p  +  1 ). 

If  (,V  -  l)/(/r  -  1)  is  not  an  integer,  our  approach  to  solving  R(X.  1)  re¬ 
quires  that  we  solve  the  reduced  recurrence  system  R(nr,  !),  where  nr  <  p:. 
One  approach  to  solving  R(nr ,  1)  is  to  use  a  technique  which  is  known  to 
be  efficient  whenever  n,  <  pc.  Applicable  techniques  include  the  algorithms 
presented  by  Chen  et  al.  [4]  and  Kogge  and  Stone  [6];  however,  these  tech¬ 
niques  are  not  desirable  because  they  require  the  machine  to  store  and  exe¬ 
cute  multiple  algorithms  based  upon  the  size  of  the  recurrence  system.  A  less 
efficient  but  more  practical  approach  to  solving  R(nr,  1)  consists  of  using 
the  proposed  technique  to  solve  the  augmented  system  R(p \  1 )  and  simply 
terminate  the  computation  when  the  last  x,  is  computed.  In  that  case  the 
number  of  steps  required  by  the  algorithm  is  at  most  l(N  -  1  )/(p2  -  1)15(/? 
-  1). 

Finally,  we  make  the  observation  that  the  above  SIMD  algorithm  most 
notably  differs  from  the  approaches  presented  in  [4,  5]  in  that  our  approach 
partitions  the  problem  and  solves  a  series  of  reduced  recurrences  of  size  p2 
sequentially.  However,  when  N  =  p1,  our  approach  reduces  to  that  of  [5], 
except  that  Gajski  presents  the  coefficient  computation  and  partial  solution 
phases  as  a  single  computational  phase.  Moreover,  when  N  s*  p2.  the  algo¬ 
rithm  of  Chen  et  al.  [4]  is  less  efficient  than  Gajski’s  as  a  result  of  implement¬ 
ing  an  extra  computational  phase  in  which  a  separate  first-order  recurrence 
of  size  p  is  solved  using  p  processors,  requiring  an  additional  2  log:/?  steps. 

When  N  and  p  are  powers  of  two,  the  algorithm  of  Chen  et  al.  requires 
SN!p  +  2  log  ip  -  5  steps  [4]  and  when  N/p2  is  an  integer.  Gajski's  SIMD 
algorithm  requires  (N/p2)(5p  -  3)  -  2  steps  [5],  whereas  when  (N-  1  )/(p:  - 
1)  is  an  integer,  our  SIMD  algorithm  requires  only  5 (N  -  !)/(/?+  1 )  steps. 
For  example,  when  N-  218  and  p  -  23.  the  numbers  of  steps  required  by  the 
SIMD  algorithms  presented  in  (4,  5]  and  this  paper  are  163.841.  151.550. 
and  145,635,  respectively. 

Finally,  when  a,  =  1 ,  for  all  /  in  [  1 . 2 . AT],  R(N,  1 )  is  a  particular  form 

of  the  parallel  prefix  problem  and  reduces  to  computing  the  cascade  sums 
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(b,  *  bf>.  (b\  +  bi  +  by) . (b\  +  b2  +  •  -  ■  +  b\)  in  parallel  using  p  proces¬ 

sors.  The  following  corollary  is  a  direct  consequence  of  Theorem  1 . 

Corollary  1 .  Given  .V  and  p  such  that  (.V  -  1 )/( p:  -  1)  is  an  integer  the 
number  of  steps  required  solve  the  parallel  prefix  problem  using  a  SIMD 
parallel  computer  with  p  processors  is  2(.V  -  1 )/( p  +  1). 

Thus,  when  (,V  -  1  )/(p:  -  l )  is  an  integer,  our  SIMD  algorithm  precisely 
achieves  the  parallel  prefix  computational  lower  bound  2(.Y  -  1  )/(p  +  1) 
established  by  Snir  [11],  This  result  improves  upon  existing  approaches  to 
solving  the  parallel  prefix  problem  when  S  >  p2.  In  that  case  the  parallel 
prefix  problem  is  solved  in  2,V/( p  +  1)  +  0{  I )  steps  by  Snir’s  algorithm  [11], 
2 {Sip)  +  log:p  -  2  steps  by  the  data-independent  algorithm  presented  by 
Kxuskal  et  al.  [7],  and  (\/p:)(2p  -  1 )  -  1  steps  by  Gajski's  algorithm  [5]. 

III.  The  SIMD  Parallel  Implementation 

The  abstract  model  of  SIMD  parallel  computation  presented  in  Section  IT 
neglected  the  issues  of  data  organization  and  alignment  as  well  as  communi¬ 
cation  overhead,  all  of  which  are  highly  machine  dependent.  We  now  present 
a  parallel  implementation  of  the  proposed  algorithm  that  takes  these  issues 
into  account.  The  SIMD  model  of  computation  considered  in  this  section 
(Fig.  3)  consists  of  p  processors  executing  the  same  operation  in  lock  step, 
with  each  processor  containing  its  own  local  storage.  The  processors  are  in¬ 
terconnected  by  a  unidirectional  ring  network  in  which  processor  i  transfers 


Out  put  Data  <i,  > 

Fig.  3.  The  practical  parallel  computational  model. 
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ciuta  to  processor  i  +  1.  i  in  [1,  2 . p  -  1],  and  processor  p  transfers 

data  to  processor  1 .  Furthermore,  we  assume  that  the  network  possesses  a 
broadcasting  capability  that  allows  any  processor  to  broadcast  data  to  all 
other  processors.  The  time  required  by  the  algorithm  will  be  determined  un¬ 
der  the  following  assumptions: 

Al.  Each  arithmetic  operation  (addition  or  multiplication)  is  per¬ 
formed  in  unit  time,  referred  to  as  a  step. 

A2.  Interprocessor  transfers  require  one  step. 

A3.  Data  broadcasts  require  one  step. 

A4.  Each  a,  and  b,  required  by  a  processor  is  assumed  to  reside  in  the 
local  memory  of  that  processor  initially. 

In  order  to  determine  an  efficient  processor  assignment,  we  first  make  the 
observation  that  the  p  consecutive  partial  solutions  updated  at  each  iteration 
of  the  update  phase  of  the  algorithm  must  reside  in  a  different  processor. 
Furthermore,  both  the  coefficient  computation  phase  and  the  partial  solu¬ 
tion  phase  of  the  algorithm  exhibit  explicit  data  dependencies  which  must 
be  preserved.  These  constraints  can  be  satisfied  if  we  rotate  the  processor 
assignment  at  each  parallel  iteration  of  the  algorithm,  and  in  that  case,  the 
algorithm  can  be  directly  mapped  to  a  SIMD  p  processor  unidirectional  ring 
network  with  broadcasting  capability.  Figure  4  illustrates  such  a  processor 
assignment  and  the  corresponding  communication  requirements  for  the  case 
N  =  17  andp  =  3. 

We  now  present  the  algorithm  to  solve  R( Ar.  1 )  as  executed  by  processor 
k,  for  all  k  in  [1.  2 . p). 

1.  PROCEDURE  R*(.V,  p,  a.  6) 

2.  x,  :=6, 

3.  FOR  w  :=  0  TO(.V  -  I )/( /r2  -  I )  -  I  DO  /*  Solve  each  R(n,  1  >  */ 

4.  .■![(,  1]  .=  a,\  lm  Begin  Coefficient  Computation  Phase  */ 

/*(=  1  +  (k-  llp  +  wip2-  I)*/ 

5.  FOR  /  :=  1  TO(p  -  1)  DO 

6.  .4(1 a, ,y. 4li  +)- I.;]; 

/*;  =  (!  +{k  -  1  -  1  )p)  mod  /r  +  w(/r  -  I)  •/ 

7.  END  FOR  /•  End  Coefficient  Computation  Phase  */ 

8.  IF  3c  =  1  THEN  x,  :  =  x,  ELSE  x, b,:  /*  Begin  Partial  Solution  Phase  */ 

/•/-  1  +(<c-  l) p  +  wl^-  1)*/ 

9.  FOR/:=  1  TO(p-  DDO 

10.  x,.,  :=  a,»,  i  + 

/*)  =  (!  +•  (k  -  I  -  Up)  mod  pr  +  wlp2  -  1 )  */ 

11.  END  FOR  /*  End  Partial  Solution  Phase  */ 

12.  FOR  m  :=  I  TO  (p  -  I )  DO  /*  Begin  Solution  Update  Phase  */ 

13.  x,., ,4[t  +  y.  i)x,.,  +.r,.,: 

/*;  =  I  mp  +  u;(  /r  -  \).j  =  (p  -  m  +  k  -  1)  mod  p  */ 

14.  END  FOR  /*  End  Solution  Update  Phase  */ 

15.  END  FOR 

16  END  PROCEDURE 
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Our  model  assumptions  imply  that  we  must  consider  interprocessor  com¬ 
munication  in  addition  to  operational  count  when  determining  the  number 
of  steps  required  by  the  algorithm.  Therefore,  we  must  examine  the  compu¬ 
tations  and  interprocessor  transfers  performed  in  loops  5.  9.  and  12.  Each 
iteration  of  loop  5  requires  an  interprocessor  transfer  of .-![/  •+-  j  —  |.  /]  fol¬ 
lowed  by  a  single  multiplication.  Thus,  loop  5  requires  Up  -  1 )  steps.  Loop 
9  is  iterated  p  -  1  times,  each  iteration  requiring  an  interprocessor  transfer 
of.v,.,_i  followed  by  a  multiplication  and  an  addition.  Thus,  loop  9  requires 
3 (p  -  1)  steps.  Loop  12  is  also  iterated  p  -  1  times,  each  iteration  requiring 
a  data  broadcast  of  .v,-j  followed  by  a  multiplication  and  an  addition.  Thus, 
loop  12  also  requires  3 (p  -  1)  steps.  The  total  number  of  steps  required  to 
solve  each  reduced  recurrence  Rip2.  1  >  using  p  processors  is  therefore 
8(  p  —  1 ),  and  hence,  the  resulting  theorem  follows. 

Theopfm  2.  Given  X  and.  p  such  that  (X  -  1  )/(p:  -  1 )  is  an  integer,  the 
number  of  steps  required  to  solve  a  linear  recurrence  system  R  X.  1 }  using  a 
SI MD  parallel  computer  with  p  processors  is  8(A'  -  l  )/(p  +  1 ). 

Among  the  existing  SIMD  algorithms  to  solve  R(X,  1).  the  SIMD  algo¬ 
rithm  presented  by  Gajski  [5]  can  be  most  efficiently  mapped  to  a  unidirec¬ 
tional  ring  network  with  broadcasting  capability.  Based  upon  the  assump¬ 
tions  made  in  this  section,  when  N /p2  is  an  integer  the  number  of  steps  re¬ 
quired  by  Gajski’s  approach  to  solve  R(X.  1)  is  (.\'//r)(8p  -  5)  -  3.  and 
therefore  when  X  >  p2  our  approach  is  more  efficient  than  Gajski's  when 
implemented  upon  a  ring  network  capable  of  broadcasting. 

Finally,  we  make  the  observation  that  the  algorithm  does  not  require  all 
of  the  inputs  a,  and  b,  in  order  for  the  processing  to  begin.  Specifically,  the 
algorithm  requires  p2  -  1  a,  and  p2  b,  for  every  5  (p  -  1 )  computational  steps, 
corresponding  to  solving  each  R(p2.  1)  in  sequence.  Similarly,  the  outputs 
V,  are  produced  in  blocks  of  p2  -  1  at  a  time.  This  suggests  that  I/O  could 
be  overlapped  with  the  computation,  providing  a  balance  between  I/O  and 
processing  loads,  and  therefore  the  deletion  of  assumption  A4  has  a  negligible 
effect  if  one  assumes  that  I/O  and  processing  can  be  done  concurrently. 


IV.  The  MIMD  Algorithm 

In  this  section  we  again  consider  the  simplistic  model  of  computation 
given  in  Section  II  with  the  exception  that  we  no  longer  require  all  processors 
to  execute  the  same  operation  at  each  step;  that  is.  we  now  consider  a  MIMD 
implementation  in  which  we  neglect  the  issues  of  data  organization  and 
alignment  as  well  as  communication  overhead. 

The  MIMD  approach  for  solving  R<  X,  1)  is  based  upon  the  observation 
that  only  p  -  1  processors  are  needed  at  each  iteration  of  the  coefficient  com- 
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putation  phase.  Assuming  (A'  -  1  )/(pr  -  1)  to  be  an  integer,  the  total  number 
of  multiplications  required  to  compute  all  necessary  coefficients  is  (N  -  1 M  p 

-  1  )/(p  +  1),  p  -  1  of  which  may  be  performed  concurrently  at  each 
step.  Therefore,  all  of  the  required  coefficients  can  be  computed  in 
(Ar  -  1  )/(p  +  1 )  steps  using  p  -  1  processors.  This  leaves  one  processor  free 
for  (N  —  1  )/(p  +  1 )  steps,  allowing  us  to  expand  the  size  of  the  recurrence  by 
at  most  n0  =  L(.Y  -  1  )/2(  p  +  1  )J  and  use  the  free  processor  to  solve  the  reduced 
system  R(n<>.  1 )  concurrently.  Thus,  using  a  MIMD  approach  we  can  solve 
the  entire  system  R(N  +  no,  1 )  in  5(iV  -  1  )/(p  +  1 )  steps. 

Given  a  recurrence  system  of  size  A'  and  the  number  of  processors  p  the 
following  lemma  expresses  n0  in  terms  of  A' and  p  only. 

Lemma  1.  Given  S and p,  n0  =  r[.Y  -  (p  +  2)]/(2/>  *  3)1. 

Given  A'and  p,  our  MIMD  approach  to  solving  R-  .V.  i  consists  of  parti¬ 
tioning  R(N,  1)  into  a  sequence  off(A'  —  no  —  1  )/(m  -  1)1+  1  reduced 
recurrences.  The  first  recurrence  is  of  size  n0  +  1  and  all  others  are  of  size  pr. 
except  for  the  last  recurrence,  which  may  be  of  size  less  than  pr.  The  coeffi¬ 
cient  computation  phase  of  the  algorithm  uses  p  -  1  processors  to  compute 
all  needed  coefficients  for  all  of  the  reduced  systems.  Concurrent  with  this 
computation,  the  free  processor  computes  the  solution  to  R(  no  +  1 ,  1 ).  Each 
subsequent  R(n.  1 )  is  then  solved  in  the  same  manner  as  in  the  SIMD  algo¬ 
rithm  by  executing  a  partial  solution  phase  followed  by  a  solution  update 
phase.  The  complete  solution  is  obtained  after  solving  all  T(Ar  -  nn  -  1 )/( p- 

-  1  )1  +  1  reduced  recurrences. 

We  now  present  the  MIMD  algorithm  to  solve  R(N,  1)  when  (AT  -  no 

-  1  )/{p2  -  1 )  is  an  integer.  As  in  the  SIMD  case,  it  is  not  difficult  to  modify 
the  algorithm  if  the  above  assumption  is  not  satisfied  by  simply  terminating 

the  computation  at  the  point  when  the  last  x,  is  updated.  For  w  in  [0,  1 . 

(A'  -  no  -  1  )/(/r  -  1 )- 1  ],  we  now  define  the  index  sets  Ua  and  Cw  as 

C„  -  { i  +  no  +  mp  +  u(  pr  -  l ) :  m  =  1,2 . p  -  1 } 

and 

C,  =  { 1  +  n0  +  mp  +  w(p:  -  1 ) :  m  =  0,  1 . p  -  1 } . 


1.  PROCEDURE  R(.V,  p.  a.  b) 

2.  x,  :=  bt 

3.  FOR  i :  =  2  to  +  1  DO  /‘Solve  R(no.  I;*/ 

4.  x,  =  a,x,-t+b, 

5.  END  FOR  /*  End  C/Jo.  1  <  Solution  */ 

6.  FOR  w  :=  0  TO  (iV  -  rc0  -  \)/([T  -  1)- I  DO/*  Begin  Coefficient  Computation  Phase*/ 

7.  FORALL  i  6  DO  IN  PARALLEL 

8.  4[(. :)  :=  a,: 

9.  END  FORALL 
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10.  FOR  i  :=  I  TO(/>-  1 1  DO 

11.  FORALL  (€(.'„  DO  IN  PARALLEL 

12.  4[;  -/./]  :=  a, .,.![/  -  !./): 

13  END  FORALL 

14.  END  FOR 

15.  END  FOR  *  End  Coefficient  Computation  Phase  * 

16.  FORa  OTOt.V  -  -  U/t/F  -  1 1  -  l  DO  *  Sohe each  R  n.  I  * 

1~,  FOR  ALL  1 1  E2  DO  IN  P  ARALLEL  /*  Begin  Partial  Solution  Phase  * 

IS.  \.-h. 

19.  END  FORALL 

20.  FOR  ;  :=  I  TO(/>  -  I )  DO 

21.  FOR  all I'.  DO  IN  PARALLEL 

22.  v ...- a.  v,...,  *  b.  : 

23  END  FORALL 

24.  FND  FOR  /*  End  Panial  Solution  Phase  *,' 

25.  FOR  t  6  C.  DO  /*  Begin  Solution  L'pdate  Phase  * 

26  FORALL /  :=  0  TO  I  p  -  1 1  DO  IN  PARALLEL 

27,  V ...  :  =  -t[l  *  ].  t)  A, -i  -  X  .  . 

28.  END  FOR  ALL 

29  END  FOR 

30.  END  FOR  ,  *  End  Solution  Update  Phase  * ' 

31.  END  PROCEDURE 

Note  that  (i)  the  coefficient  computation  phase  of  the  SIMD  algonthm  has 
been  modified  so  as  to  compute  the  necessary  coefficients  for  all  R  n.  I 
before  the  first  reduced  recurrence  is  solved  in  parallel,  and  (li)  the  processor 
that  is  idle  during  the  SIMD  coefficient  computation  phase  is  now  used  to 
concurrently  compute  the  solution  to  R(n0  +  1.  1  y.  An  example  illustrating 
the  computations  performed  by  the  MIMD  algonthm  is  given  in  Fig.  5  for 
the  case  X  =  19  and  p  =  3. 

Based  upon  the  MIMD  model  considered  in  this  section,  we  conclude  that 
the  time  required  by  the  MIMD  algorithm  is  determined  by  the  computa¬ 
tional  operations  performed  in  loops  3. 6.  and  1 6.  Loops  3  and  6  are  executed 
concurrently,  using  1  and  p  -  1  processors,  respectively.  Loop  6  requires 
IN  -  nn  -  1  )/(p  +  I )  steps,  and  the  quantity  n,,  has  been  defined  so  that  loop 
3  requires  at  most  the  same  number  of  steps  as  loop  6.  All  p  processors  are 
used  in  executing  loop  16.  and  thus  loop  16  requires  MS  -  -  1 )/( p  r  1 ) 

steps.  The  number  of  steps  required  by  the  MIMD  algonthm  is  therefore 
5(\  -  no  -  1  )/{p  +  1 )  steps.  Thus,  the  resulting  theorem  follows. 

Theorem  3.  Given  X  and  p  such  that  X  >  pr  +  ip  -  I.  the  number  of  steps 
required  to  solve  a  linear  recurrence  system  R(X.  1 }  using  a  Ml  SID  parallel 
computer  with  p  processors  is  at  most  f(,V  —  i)/(  p  +  3/2  )(p-  1)151/?-  1). 

When  X  =  pr  4-  Ip  -  1,  our  approach  reduces  to  the  MIMD  algonthm 
presented  in  [5],  in  which  the  number  of  steps  required  to  solve  R(X.  1  is 
at  most  f(.V  -  1  )!(pr  +  \p  -  2)15 (p  -  I ).  When  X  >  pr  +  \p  -  1 .  our  MIMD 
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approach  differs  from  [5]  by  organizing  a  single  coefficient  computation 
phase  to  compute  the  necessary  coefficients  for  all  R  n.  1  before  the  first 
reduced  recurrence  is  solved  in  parallel,  rather  than  including  a  coefficient 
computation  phase  as  part  of  solving  each  reduced  recurrence. 

Finally,  we  note  that,  like  the  SIMD  algorithm,  the  M1MD  algorithm  can 
also  be  mapped  directly  to  a  p  processor  unidirectional  ring  network  with 
broadcasting  capability.  Figure  6  illustrates  such  a  processor  assignment  and 
the  corresponding  communication  requirements  for  the  case  .V  =  19  and 
P  =  3. 


V.  Conclusions 

The  algorithm  presented  in  this  paper  exploits  the  fact  that  for  a  fixed 
number  of  processors  p.  the  parallel  approach  presented  in  [9]  to  solve 
R(X,  1 )  attains  maximum  speedup  tip  +  1 )  when  X  =  p:.  When  X  ?  p:.  the 
structure  of  R(X,  1 )  allows  the  solution  to  be  obtained  by  sequentially  solv¬ 
ing  a  series  of  reduced  recurrences,  each  of  size  pr.  except  for  the  last  recur¬ 
rence  system,  which  may  be  of  size  less  than  p1.  As  a  result,  we  are  able  to 
improve  upon  existing  approaches  for  solving  R(X.  1 )  whenever  X  >  p1. 
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ABSTRACT 

In  this  paper  we  present  the  parallel  implementations  of  two 
iterative  gradient  based  algorithms  to  solve  the  unconstrained  linear 
quadratic  regulator  optimal  control  problem.  We  show  that  parallel 
evaluation  of  the  step  length  and  gradient  of  the  quadratic  cost  func¬ 
tion  can  be  efficiently  performed  as  a  function  of  the  number  of  pro¬ 
cessors.  We  then  embed  our  parallel  step  length  and  gradient  pro¬ 
cedures  to  produce  parallel  implementationa  of  the  gradient  and  con¬ 
jugate  gradient  methods  that  may  be  executed  on  an  SIMD  machine. 

I.  INTRODUCTION 

Previous  parallel  approaches  to  the  solution  of  optimal  control 
problems  (4|,  [6],  (9|,  have  been  devised  without  explicitly  taking  into 
consideration  the  computational  environment.  In  particular,  when  the 
number  of  available  processors  is  small  in  relation  to  the  problem  size, 
the  above  techniques  simply  fold  the  computations  to  fit  the  number 
of  processors.  More  efficient  parallel  algorithms  may  be  devised  by 
considering  the  computational  environment  throughout  the  algorithm 
synthesis.  Toward  that  end,  we  present  in  this  paper  a  parallel  pro¬ 
cedure  for  gradient  evaluation  which  is  formulated  as  a  function  of  the 
number  of  available  processors.  Although  presented  in  the  context  of 
unconstrained  Optimal  control,  our  results  for  gradient  computation 
are  also  applicable  to  constrained  problems.  Furthermore,  we  show 
that  the  step  size  obtained  as  a  result  of  the  line  search  performed  at 
each  iteration  may  also  be  efficiently  computed  in  parallel.  We  then 
combine  the  techniques  for  parallel  gradient  evaluation  and  step  size 
determinmtion  to  produce  parallel  implementations  of  the  best-step 
steepest  descent  method  and  the  Fletcher- Reeve*  conjugate  gradient 
method  to  solve  the  linear  quadratic  regulator  control  problem  [5]. 

It  is  well  known  that  a  doted  loop  feedback  form  solution  exists 
for  the  linear  quadratic  regulator  control  problem  (LQR).  Our 
motivation  for  solving  that  problem  using  iterative  gradient  based 
techniques  is  that  our  basic  parallel  approach  can  be  applied  to  more 
complex  control  problems  in  which  the  system  dynamics  can  be  linear¬ 
ized  and  the  com  approximated  quadrabcally.  Furthermore,  efficient 
parallel  impiementattoa*  of  gradient  method*  suds  in  best-step 
steepest  descent  and  conjugate  gradient  mgpsta  that  similar  parallel 
implementationa  of  penalty  function  and  gradient  projection  methods 
may  be  used  to  solve  constrained  control  problem. 

Our  approach  to  the  parallel  evaluation  of  the  step  and  gradient 
reduce*  the  total  number  of  operation*  required  by  sharing  common 
terms  when  possible  and  then  introduce*  panilelim  The  degree  of 
par *lleLi*at  exhibited  by  the  step  end  gradient  computation  technique* 
presented  in  this  paper  varies  as  a  function  of  the  number  of  proces¬ 
sors  to  be  used.  We  oonatrain  the  number  of  tvailabie  processors,  p, 
to  lie  in  (he  range  l<,p  <I'iArv\  where  n  is  the  tea  of  the  system  state 
vector,  S  is  the  number  of  Mage*  in  the  control  process  usd  we 
assume  n  >m,  where  m  is  the  uza  of  the  coetroL  One  of  the  feature* 
of  the  proposed  parallel  iterative  algorithm  is  that  their  structure  is 
completely  specified  by  the  number  of  processors  whenever  the 
number  of  stages  H  >(p  /nf. 

An  efficient  technique  for  gradient  evaluation  using  a  single  pro¬ 
cessor  has  been  discussed  by  Potak  [14,  ppj66-69[.  A  direct  implemen¬ 
tation  of  this  technique  on  p  processors  achieves  linear  speedup  for  p 
up  to  it;  however,  for  p  >  it,  the  ipeedup  is  significantly  reduced.  Us 
thus  paper,  we  present  an  approach  to  gradient  computation  which 
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reduce*  to  a  direct  parallel  implementation  of  the  technique  given  in 
[8[  when  i<p<n  and  achieves  speedup  greater  than  i nip  -  *) 
when  n  <  p  <  mVv’ . 

A  critical  step  in  our  approach  involves  the  parallel  compulation 
of  the  state  and  costale  vectors.  When  n  -  1,  the  computations 
reduce  to  solving  forward  and  reverse  linear  recurrence  systems,  both 
of  size  V.  The  parallel  evaluation  of  m-th  order  linear  recurrence  sys¬ 
tems  has  been  extensively  studied  [1|-(3|,  [7],  To  solve  first-order 
linear  block  recurrence  systems  in  parallel  we  use  a  blocked  formula¬ 
tion  of  the  approach  presented  in  (7). 

The  organization  of  this  paper  is  as  follows  :  m  Section  II.  we 
state  the  discrete  linear  quadratic  optimal  control  prob¬ 

lem.  examine  the  gradient  of  the  com  function  and  give  the  steepest 
descent  algorithm  we  shall  consider.  Section  in  presents  the  step 
length  and  gradient  compulations  required  at  each  iteration.  In  Sec¬ 
tion  IV  we  give  a  parallel  procedure  to  solve  the  linear  recurrence  sys¬ 
tems  required  by  Section  HI.  Based  upon  the  results  of  Sections  III 
and  IV,  Section  V  presents  parallel  implementations  of  the  best-step 
steepest  descent  method  to  solve  the  LQR  problem  and  the 
corresponding  performance  analysis.  Finally,  in  Section  VI  conclu¬ 
sions  are  presented. 

IL  PRELIMINARIES 

We  consider  the  LQR  discrete  optimal  control  problem: 

Problem  I:  Given  an  m-input,  discrete,  time-varying  linear  system  in 
which  we  are  given  the  inibei  stele,  z«,  and 

A  -  A*.i  *  i  *  (1) 

where  for  i  •  O.L...^V,  z,  in  £*  is  the  state  of  the  system  at  time  i  and 
for  i  »  12...JV.  u,  in  Em  is  the  control  M  time  i,  find  the  mN  control 
vector u  •  (as i,  uj . uj»)‘  that  minimize*  the  performance  index 

/(«)  ■  -i  [z'Qz  *  u'Ruj, 

where  z  i*  the  mV  vector  r  -  (z^,  . zVA  Q  -  Diag(Q,,  C2,.., 

Qe)  is  the  mV wiAf  block  matrix  that  consist*  of  N  nxn  symmetric  posi¬ 
tive  senu-definite  blocks  Qt  and  R  •  Oiag(ffi,  Rj,.._  R„)  is  the 
mNxmN  block  matrix  that  ocinsisti  of  N  mxm  symmetric  positive 
definite  blocks  K,. 

The  hypotheses  oa  the  matrices  Q  sad  R  insure  that  Problem  1 
possesses  a  unique  solution  u"  sad  that  (dl  /du  \  •  0. 

We  now  introduce  a  formulation  for  dl  /At  that  a  used  in  our 
parallel  implementation  of  gradient  algorithms. 

A  direct  application  of  the  chain  rule  for  differentiation  yields 

d/  .  3L  *  2L 

du  du  dt  At' 

•here  the  z*’s  are  determined  in  accordance  with  Eq.  (1).  d!  /du  and 
dl /dt  are  the  LxrmV  and  LxrnV  Jacobian  matrices  u‘R  and  i‘Q  respec¬ 
tively,  and  dt/At  is  the  mVurmV  block  lower  triangular  Jacobian 
matrix  that  consists  of  V1  nxm  blocks  (dt/At)^  •  dz,/du,  obtained 
by  the  chain  rule  for  ail  i  and  j  in  (LA-Al  by 

0  i li  <  j 

ffy  i ti  •  j  (2) 

AAf-i  '  '  A/.iB,  if  f  >  /- 

Eq.  (2)  show*  that  the  influence  matrix  dz/At  satisfies 
F,  (dz/At)  •  F„  where  P,  is  the  hVvjijV  block  lower  bidiagonal 


d* <_ 

du, 


t 
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aunt*  that  consist*  of  S2  nxn  blocks  [fj  sad  is  the  nNxmN 
block  diagonal  matrix  that  consists  of  A/*  nwn  blocks  \FJ  defined 
for  all  i  and  j  in  by 


K 


i 

A, 

0 


lfl  -  j 
til  »  /  *  1. 
otherwise 


M. 


ifi  ■  ; 

otherwise. 


Let  £0  be  the  nNvi  block  matrix  that  consists  of  N  nxr i  block! 


jVa]  defined  for  ail  i  in  (L2......V]  by 


A  i  if  i  «  1 

0  otherwise. 


let  g  •  dJ  du‘  be  the  the  m.Vxl  gradient  of /(u)  with  respect  to  u  and 

let  A  be  tne  /tiV  costate  vector  A  «  (A*,,  A$ . A1*)*.  A,  in  £*,  defined 

by  A  »  F'Qz.  Then,  given  u  and  to,  the  gradient  g  may  be  obtained  by 
using  the  three  equations 

F, t  -  F,u  *  FdI,,  (3) 

-  0*.  (*) 

g  -  Ru  *  F\X.  (5) 


With  the  notation  g*  »  (dl /du) l.,»,  the  version  of  the  best- 
step  steepest  descent  method  that  we  use  ts  the  following; 

Algorithm  I:  Let  u 1  be  given. 

StepO:  Set*  •  l  and  compute g1. 

Step  1:  If  ||g*  ||2  <  a  stop;  else  go  to  Step  2. 

Step  2:  Compute  a*  to  J  (ir*  -  aV). 

Step  3:  Compute  g**1 
Step  4;  Let  u**1  ■  u*  -a*g*. 

Step  5:  Set  *  «  k  ♦  1  and  go  to  Step  1. 

In  Section  HI  we  present  our  approach  to  the  computation  of 
gl,  a*,  and  g**1  and  we  show  that  the  computation  of  a*  and  g**1 
shares  common  terms. 


HI.  GRADIENT  AND  BEST  STEP  COMPUTATION 

We  first  consider  the  computation  of  g1.  which  is  performed 
only  once  in  Step  0  of  Algorithm  1.  As  a  consequence  of  Eqs.  (3).  (4) 
and  (S3  we  obtain  the  gradient  evaluation  technique  proposed  by 
Polak  (81: 

Algorithm  i  Given  u 1  and  r«. 

Step  I:  Computer1  such  that  Fjl  »  Fjt1  *  £*rt. 

Step  2:  Compute  A1  such  that  /VA1  *  Qtl- 
Step  3:  Compute  g1  *  Ku 1  *  /V*1. 

Due  to  the  lower  block  butiagrmal  structure  of  F„  Step*  I 
and  2  of  Algorithm  2  require  the  solution  of  N  wage  forward  and 
reverse  nx*  block  first-order  linear  recurrence*,  respectively.  Parallel 
procedures  for  linear  recurrences  are  presented  is  oea  section.  Given 
A1.  Step  3  computet  g1  by  computing  each  of  the  N  uncoupled  com¬ 
ponents  g}  »  fd//dii()i.,i;  thus,  Step  3  exhibits  linear  speedup  when 
executed  in  parallel 

We  now  conmder  the  computation  of  the  optimal  step  length 
a*.  The  cost  function  is  quadratic  and  therefore  a  closed  form  solu¬ 
tion  for  a*  exists.  It  is  dear  that 

/(*)- { -'<*  *  rsiQr.'F.yu 

*  {  V\FlQF}r*, 


a  -  Y<g*.d*>,6  •  -<g*/>,c  -  J(u‘ ). 

dk  *  V  *  rsiQr.'Fj, 
and  it  follows  that  the  optimal  step  length  a*  is 

a* .  J-  . 

2a  <g*,d*> 

Once  d‘  is  known,  the  gradient  g*  * 1  is  easy  to  evaluate  using 
g**‘  -g*-«*d*. 


Our  approach  to  computing  the  quantity  d*  consists  of  using 
Algorithm  3  below,  where  we  note  that  instead  of  requiring  £,‘  and 
£?,  we  solve  linear  systems  corresponding  to  F,  and  F\. 

Algorithm  3:  Given  g*. 

Step  1:  Computes/*  *  Kg*. 

Step  2:  Compute  j*  -  £,g*. 

Step  3:  Compute  w*  such  that  £,w*  -  a* 

Step  4:  Compute  if*  -  Qwk 

Step  S:  Compute  v*  such  that  F',v‘  -  y* 

Step  6:  Compute  x*  »  F\v*. 

Step  7:  Compute  d*  ■  (/*  ♦  x*. 

With  the  exception  of  Step*  3  and  5,  Algorithm  A  computes 
d*  by  executing  a  series  of  matrix-vector  products  followed  by  a  lector 
sum.  Each  of  the  matrix-vector  products  consists  of  .V  uncoupled 
block  matrix-vector  products,  thus  exhibiting  linear  speedup  when 
implemented  in  paralleL  However,  due  to  the  structure  of  £„  Steps  3 
and  5  require  the  solution  of  S  stage  forward  and  reverse  nxn  block 
first  order  linear  recurrences,  respectively.  As  in  the  case  of  Algo¬ 
rithm  2.  this  again  suggests  the  need  for  parallel  procedures  to  solve 
linear  recurrences.  Note  that  the  state  r**1  and  costate  A*'1 
corresponding  to  u**1  can  be  obtained  easily  from  the  quantities  w* 
and  v*  computed  in  Step*  3  and  3  of  Algorithm  3,  that  is. 

x**'»r*.(/w*  and  A**1  *  A* -<x*v* 


IV.  PARALLEL  procedures  for  linear  recurrences 


The  mod-1  of  ST  cry  parallel  computation  that  we  use  consists 
of  a  global  parallel  memory,  p  parallel  processors,  and  a  control  unit, 
where  all  processors  perform  the  same  operation  at  each  time  step. 
We  further  simplify  the  model  by  making  the  following  assumptions: 
(i)  each  computational  operation  takes  the  same  amount  of  time, 
referred  to  as  a  step,  (ii)  there  are  no  acrrssing  cooficts  in  global 
memory,  (m)  all  initial  data  reside*  in  global  memory,  and  (rv)  there  is 
no  time  required  to  access  global  memory. 

We  now  present  t  blocked  version  of  the  parallel  scalar 
approach  given  in  (7]  to  solve  forward  N  stage  nxn  block  first-order 
linear  recurrence  systems  that  we  use  to  implement  Algorithms  2  and 
3. 


The  forward  recurrence  problem  is:  given  nxn  matrices  i 
■  2,  3,—  S  and  given  vectors  a  £*,  i  *  L2.....V,  find  the  n  vectors  t, 
such  that/]  -  7i  andx,  »  Aa,  t  ♦  *n,t  ■  2,  3,..„  ,V.  Let 


0  • 


,V-1 

fp/nY-l) 

N  ■  1 


tip  >  n 


otherwise. 


[Pi* 


tip  >  n 
otherwise. 


For  w  in  (0.L— /VI),  define  the  index  set* 


/.(w)  ■  {«  ♦  u**2-!)  :i  ■  *-L*-2,...,l) 


and 


fi(w)  *  (m  *  Matfw* Luf*2-!))  :i  • 


therefore 

/(«*  -og*)  -  tcY  ♦  ha  ♦  c. 


Thu*,  given  y  -  . -rV)*,  ■*  »£*  *»d  precomputed  A  [/  */,/ ) 

*  Ai.fAt  „.|  Aj,j  */,(«),  i  in  (O.L-^-1),  the  foUorwmg  procedure 
solve*  the  forward  block  recurrence  system,  where  for  presentation 
simplicity,  we  assume  that  0  and  a  are  integers. 


where 
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1.  PROCEDURE  FORWARD/iV,/i^>,y) 

2.  Z(  7t 

3.  FOR  u  :  •  0  TO  fl  •  1  DO 

4.  FORALL  j  *  /,(«)  DO  IN  PARALLEL  z, :  *  7,; 

5.  FOR  i .  -  1  TO  .Vmr(l,*-1)  DO 

6.  FORALL  j  i  ( i(w)  DO  IN  PARALLEL 

2-  4|  *j  :  *  -4*  *j4i  *1*1  *'*♦!» 

8.  END  FORALL 

9.  END  FOR 

10.  FORALL/ «/a(u)  DO 

U.  FOR  i  OTO*-1  DO  IN  PARALLEL 

12.  J,.,  :-A[i  ♦/.;lr,.,  *z..,; 

13.  END  FOR 

14.  END  FORALL 
U.  END  FOR 

16.  END  PROCEDURE 

When  1  <p  <",  tiie  index  set  f0(u)  is  empty  tod  procedure 
FORWARD  reduces  to  sequentially  executing  step  7  .V-l  times,  each 
execution  using  p  processors.  When  n  <  p  <nN'A,  procedure  FOR¬ 
WARD  sequentially  solves  0  reduced  block  recurrence  systems  in  z,  of 
size  (pin? ,  each  in  parallel.  Each  reduced  system  is  solved  in  two 
phases:  the  first  phase  consists  of  the  execution  of  steps  4  and  S  and 
computes  (p/n ?  partial  solutions,  in  which  the  fust  p  In  are  the 
actual  solutions  and  the  second  phase  consists  of  the  execution  of  loop 
10  in  which  the  precomputed  "  xn  block  matrices  A  (i  * ;,/  ]  are  used  to 
update  the  next  p  /n  partial  solutions  at  each  iteration.  We  assign  n 
processors  to  perform  each  of  the  p  /n  concurrent  executions  of  steps 
7  and  12.  The  complete  solution  to  the  block  recurrence  system  is 
obtained  after  executing  fl  iterations  of  loop  3.  If  fl  is  not  an  integer, 
then  we  replace  fl  by  f  O']  and  simply  terminate  the  computation  when 
is  computed  and  if  a  is  not  an  integer,  we  replace  «  by  [p  /^. 

A  similar  procedure  REVERSE  used  to  solve  reverse 
recurrences  in  parallel  may  be  obtained  by  a  straightforward 
modification  of  procedure  FORWARD  and  hence  will  not  be  given. 

We  now  give  the  number  of  steps  required  to  solve  either  an 
N  stage  forward  or  reverse  first-order  nun  block  linear  recurrence  sys¬ 
tem. 

Theorem  I :  Given  N,  n  andp  such  that  p  « 1  or  p  /"  is  an  integer,  the 
number  of  parallel  stepa  required  to  solve  a  block  "» i  fust-order 
linear  recurrence  system  of  length  N  using  p  processors  is 
2 -2 

(N-l)—  if  1  <p  <n 

T  P 

•  "  \TT\Mp  - n )  iln  <p<nNv'. 


It  is  clear  from  Theorem  1  that  the  speedup  S,  •  T X/T,  exhi¬ 
bited  by  the  procedures  FORWARD  and  REVERSE  is  p  when 
1  <p  <n  and  WUp  *  ")  when  «  <p  SaiV**  and  p  /"  and  fl  are 
integers.  The  corresponding  efficiency  E,  »  5,/p  a  therefore  1  when 
l  <p  <n  and  in  *  n  f2p  when  n  <p  inNvt  and  p/n  and  (1  are 
integers.  In  Figs,  l,  2.  3  and  4  we  plot  E,  ter  tha  values  of  n  -  8,  16, 
32  and  64  respectively,  where  the  effideney  corresponding  to  the  pro¬ 
cedures  FORWARD  and  REVERSE  is  denoted  bv  the  solid  line  in 
each  plot  Thus,  we  see  that  the  efficiency  increases  with  increasing 
values  of  n  and  p  and  j>  independent  of  N. 

V.  PARALLEL  H3T  STEF  STEEPEST  DESCENT 

We  now  use  the  penile!  procedures  for  the  solution  of  linear 
recurrences  discussed  in  the  previous  section  to  obtain  parallel  imple¬ 
mentations  of  Algorithms  2  and  3  give  the  corresponding  number  of 
stepa  required  for  their  execution  when  p  processors  are  used. 

We  fust  give  the  parallel  implementation  of  Algorithm  2. 

1.  PROCEDURE  GRADIENT(rt,u l) 

2.  FORALL  f  a  {  L2.--A}  DO  IN  PARALLEL  -,}  :  -  B,u> ; 

3.  y|  -ti  *  A,t,; 

4.  rl  -  FORWAR D(Nap,?); 

5.  FORALL  i  r  { L2,-JV}  DO  IN  PARALLEL  rf  :  •  Qj}; 


6.  A1  -  REVERSEfiV./i^.f1); 

7.  FORALL  i  r  { L2,..,.V}  DO  IN  PARALLEL #<  -  R,u,‘  -  B'X 

8.  RETURN#1; 

9  END  PROCEDURE 


Lemma  I:  Given  to,  u V,  ",  m  and  p  such  that  p  •  1  or  p  /"  is  an 
integer,  the  number  of  stepa  required  py  use  procedure  gkaliicNT 
to  compute;1  using p  processors,  1  <p  <nNv\  is 


►  2m -2)  ■ 


(h»  •lm-1)  -  - 


■f  1  <p  <« 


(l»*i».2).  *2(11-1)  ♦  rtTinpw) -Js  ipiwvv* 


We  neat  give  the  parallel  implementation  of  Algorithm  3. 

1.  PROCEDURE  DIRECTION/#*) 

2.  FORALLi  r{L2,..,N}  DO  IN  PARALLEL**  -  R#*; 

3.  FORALL  i  «  { L2...,iV}  DO  IN  PARALLEL  f*  :  - 

4.  w*  •  FOR  WARD/, V^i^i.f*); 

5.  FORALL  i«{L2,..,N)  DO  IN  PARALLEL  u*  O.wf  . 

6.  v*  -  REVERSEfN^.q*); 

7.  FORALL i  «  {L2. .. „N}  DO  IN  PARALLEL  yf  -Sjvf; 

8.  FORALLi*{L2,  vV)  DO  IN  PARALLEL  d*  :- m,*  *  xf; 

9.  RETURN  d*; 

10.  END  PROCEDURE 

Lemma  Z  Given  #*,  N,  n,  m  and  p  such  that  p  ■  1  or  p  n  n  in 
integer,  the  number  of  stepa  required  by  procedure  DIRECTION  to 
compute  d*  using p  processors,  1  <p  </iNu’,  is 

££-(«■* 2*. -J)  *  |^jo*2»-l).-y-  il<p<« 

,  *»».23  *  [-£~jo*»».i)  ♦  rniao^s)  ,/»  <p<«rrv. 


We  now  embed  the  parallel  procedures  GRADIENT  and 
DIRECTION  to  obtain  a  parallel  implementation  of  Algorithm  l  and 
we  then  give  the  corresponding  number  of  steps  required  for  one 
iteration. 

1.  PROCEDURE  PSDM/r,,*1) 

2.  k  •  1; 

3.  g1  -  GRADtENTfr,.!!1); 

4.  WHILE  ||#*  ||J  >»  DO 

5.  d*  -  DIRECTION/#*); 

6.  a*  •  <|*X*>/<#*.d*>; 

7.  FORALL i  «  (L2...,N)  DO  IN  PARALLEL#*'1  •  gt  -  a*d,*; 

8.  FORALL i«(L2...,N}  DO  IN  PARALLEL  u,*'1  -  u,‘  -  a*#f; 

9.  t  *  *  ♦  1; 

10.  END  WHILE 

1L  END  PROCEDURE 


Theorem  Z •  Given  it,  u1,  N,  ",  m  and  p  such  that  p-lorp/nisan 
integer,  the  number  of  stepa  required  by  oae  iteration  of  procedure 
PSDM  using p  processors.  1  <p  <nMvt,  is 


He. 

(6n  ♦2m-2)  ♦ 

Mw 

p 

P 

Nn 

(2n  +2m-2)  ♦ 

iVm 

P 

? 

:0"*2»*T).-=-  *  2l»f* 


IM<P$" 


(>  *S»  *7)  ♦  rnlKp-")  *  21af if  sin  <  p  <*VV1 


As  a  consequence  of  Theorem  2.  it  may  be  shown  that  the 
speedup  S,  and  efficiency  E,  for  procedure  PSDM  are  bounded  from 
below  by 

^-■J->0dp.  Ef  •  -jr-  >  0 A 
Tr  Pr* 

In  Figs.  L  2,  3  and  4,  we  plot  E,  for  the  procedure  PSDM  for  the 
values  a  ■  8,  16,  32  and  64  respectively,  and  in  each  case  use  the 
values  m  •  L  4  and  8.  It  is  then  easy  to  see  that  E,  increases  with  m, 
and  that  the  effideney  of  the  procedure  PSDM  is  bounded  from  below 
by  the  effiaeaey  of  the  procedures  FORWARD  and  REVERSE 
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VI.  CONCLUSIONS 

la  this  paper  a  parallel  implementation  of  the  beat-step  steepest 
descent  method  has  been  presented  to  solve  the  LQR  optimal  control 
problem.  The  procedure  exhibits  the  desirable  property  that  its  struc¬ 
ture,  and  hence  parallelism,  is  determined  by  the  number  of  avaiisble 
processors.  Thus,  unlike  approaches  in  which  the  structure  of  the  pro¬ 
cedure  changes  with  problem  sis,  the  procedure  presented  in  this 
paper  maintains  the  same  computational  and  interprocessor  commun¬ 
ication  requirements,  independently  of  the  number  of  stages  in  the 
control  problem.  Furthermore,  the  procedure  hss  been  shown  to 
exhibit  an  efficiency  £,  always  greater  than  0.6. 

The  paper's  basic  approach  can  be  used  to  produce  parallel 
implementations  of  more  complex  gradient  based  methods.  For 
example,  the  procedure  PSDM  may  be  easily  modified  to  produce  the 
following  parallel  version  of  the  Fietcher-Reeves  conjugate  gradient 
method. 

1.  PROCEDURE  PCGM(z0.u‘) 

I  k  -  l; 

3.  »*■(*■  GRADIENT^,, u1); 

4.  WHILE  | |gJ||J  >  «DO 

5.  d*  -  DIRECTION!*1); 

6.  or*  «  <gr*,»*>/<d*,ir*>; 

7.  FORALLi  *{1, 2. DO  IN  PARALLEL gf'1  gt  ■  o*df; 

8.  FORALLi  t  {L2,..„.V}  DO  IN  PARALLEL  u,*'1  u,*  -  cs*vf ; 

9.  5*  -  <#*' 

10.  FORALLi  «(L2,.~,.V}  DO  IN  PARALLEL 

.**»  :-*•*.*»?; 

U.  *  -  *  ♦  1; 

12.  END-WHILE 

13.  END  PROCEDURE 

Finally,  we  oou  that  the  procedures  presented  in  this  paper 
may  be  used  to  solve  discrete  optimal  control  problems  which  involve 
nonquadratic  coat,  nonlinear  dynamics  and  constraints  on  the  states 
and/or  controls.  In  that  case,  one  needs  to  use  penalty  function  sad 
gradient  projection  methods  and  suitable  approximations  to  coat  func¬ 
tion  and  system  dynamics. 
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8.  APPENDIX  III 

PARALLEL  GRADIENT  PROJECTION  ALGORITHMS  TO  SOLVE  THE  DISCRETE  LQR  OPTIMAL 
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ABSTRACT 

In  this  paper  we  present  two  parallel  gradient  based  iterative  algorithms  to  solve  the  linear  quadratic 
regulator  (LQR)  optimal  control  problem  with  hard  control  bounds.  In  the  first  part  of  the  paper,  we 
introduce  the  algorithms  in  the  context  of  the  general  class  of  problems  to  which  they  are  applicable.  The 
first  algorithm  is  a  parametrized  gradient  projection  method  and  can  be  used  to  solve  any  convex  program¬ 
ming  problem.  The  second  algorithm  is  a  combination  of  the  first  algorithm  with  a  constrained  version  of 
the  Fletcher-Reeves  conjugate  gradient  method  and  can  be  used  to  solve  linear  inequality  constrained 
problems.  We  then  use  the  two  algorithms  to  solve  the  LQR  optimal  control  problem  with  hard  control 
bounds.  In  the  second  part  of  the  paper,  we  show  that  at  each  iteration  parallel  evaluation  of  the  step 
length  and  projected  gradient  of  the  quadratic  cost  function  can  be  efficiently  performed  as  a  function  of 
the  number  of  processors.  We  then  embed  our  parallel  step  length  and  gradient  projection  procedures  to 
produce  two  parallel  algorithms  which  are  suitable  for  real-time  online  implementation  on  a  SIMD 
machine. 


I.  INTRODUCTION 

Practical  iterative  methods  to  solve  optimal  control  problems  must  exhibit  fast  convergence  coupled 
with  low  computational  overhead  per  iteration  so  that  they  may  be  implemented  in  a  real-time  online 
environment.  Furthermore,  they  must  be  interior  descent  methods  since  infeasible  approximations  to  the 
solution  are  unacceptable  for  most  applications.  In  this  paper  we  present  two  gradient  based  iterative  inte¬ 
rior  methods  for  the  parallel  solution  of  the  discrete  LQR  optimal  control  problem  with  hard  control 
bounds.  The  algorithms  are  synthesized  so  that  each  iteration  may  be  efficiently  executed  on  a  parallel 
computer.  Unlike  previous  parallel  approaches  to  the  solution  of  optimal  control  problems  which  simply 
fold  the  computations  to  fit  the  number  of  processors  [MEY73],  [SCH81],  [TRA80],  our  approach  has 
been  devised  by  explicitly  considering  the  given  computational  environment.  Consequently,  one  of  the 
features  of  the  parallel  algorithms  presented  in  this  paper  is  that  their  structure,  and  hence  their  parallel¬ 
ism,  is  determined  by  the  number  of  available  processors,  resulting  in  algorithms  which  are  matched  to  the 
given  parallel  environment 

We  introduce  the  algorithms  in  the  context  of  the  general  class  of  problems  to  which  they  are  appli¬ 
cable.  The  first  algorithm  is  a  parametrized  gradient  projection  method  and  can  be  used  to  solve  any  con¬ 
vex  programming  problem.  We  parametrize  the  algorithm  for  two  purposes:  first  the  algorithm  contains 
a  so  called  (-procedure  for  determining  the  active  constraints  in  order  to  prevent  the  possibility  of  jam¬ 
ming  and  to  ensure  convergence;  unlike  approaches  which  generate  a  sequence  of  r's,  our  approach  uses  a 
constant  ( for  ail  iterations.  Secondly,  we  investigate  the  performance  of  the  algorithm  for  various  param¬ 
eter  choices  with  the  goal  of  obtaining  improved  convergence.  The  second  algorithm  is  a  combination  of 
the  first  algorithm  with  a  constrained  version  of  the  Fletcher-Reeves  conjugate  gradient  method  and  can 
be  used  to  solve  any  problem  with  linear  inequality  constraints.  Furthermore,  we  show  that  a  slight 
modification  of  our  second  algorithm  results  in  the  algorithm  exhibiting  finite  convergence  on  problems 
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with  quadratic  cost. 

We  then  use  c  algorithms  to  solve  the  LQR  optimal  cont.  problem  with  hard  control  bounds 
which  we  rewrite  as  a  bounded  variable  quadratic  programming  problem  with  special  structure.  There 
exist  many  interior  methods  applicable  to  this  class  of  problems,  including  feasible  directions,  gradient  pro¬ 
jection,  reduced  gradient,  variable  metric  and  numerous  quadratic  programming  methods.  Our  motivation 
for  devising  a  gradient  projection  method  is  due  to  the  simple  projection  computation  that  occurs  with 
respect  to  the  control  problem  of  interest  as  well  as  avoiding  large  matrix  computations  involving  the  exact 
or  approximated  Hessian. 

We  give  two  parallel  implementations  of  our  algorithms  to  solve  the  LQR  optimal  control  problem 
with  hard  control  bounds.  In  both  implementations  the  step  length  and  gradient  is  obtained  by  using  the 
parallel  procedures  presented  in  [MEY87bj.  The  procedures  share  common  terms  and  exhibit  varying 
degrees  of  parallelism  as  a  function  of  the  number  of  processors.  We  constrain  the  number  of  available 
processors,  p,  to  lie  in  the  range  1  <  p  <  nl'P* ,  where  n  is  the  size  of  the  system  state  vector,  N  is  the 
number  of  stages  in  the  control  process  and  we  assume  n  >nt,  where  m  is  the  size  of  the  control. 

By  employing  a  blocked  formulation  of  the  approach  for  first-order  linear  recurrences  given  in 
[MEY87a],  we  embed  in  our  algorithms  the  approach  for  gradient  computation  given  in  [MEY87b]  which 
reduces  to  a  direct  parallel  implementation  of  the  technique  given  in  [POL71,  pp. 66-69]  when  l<p  < n 
and  achieves  speedup  greater  than  1/2 (p  +  n )  when  n  <  p  <  niVv* .  In  addition,  one  of  the  features  of  our 
proposed  parallel  iterative  algorithms  is  that  their  structure  is  completely  specified  by  the  number  of  pro¬ 
cessors  whenever  the  number  of  stages  N  >  (p  /n)1. 

The  organization  of  this  paper  is  as  follows  :  Section  II  introduces  the  convex  problem  and  associ¬ 
ated  optimality  conditions.  Section  IH  presents  the  parametrized  gradient  projection  algorithm  and 
corresponding  implementation  for  the  bounded  variable  quadratic  programming  problem.  In  Section  IV 
we  combine  the  algorithm  given  in  Section  III  with  a  linearly  constrained  version  of  the  Fletcher-Reeves 
conjugate  gradient  method.  We  then  rewrite  the  LQR  problem  with  hard  control  bounds  as  a  bounded 
variable  quadratic  programming  problem  in  Section  V.  Section  VI  presents  the  parallel  algorithms  to 
solve  the  LQR  problem  with  hard  control  bounds  and  the  corresponding  performance  analysis.  Finally,  in 
Section  VII  conclusions  are  presented. 

II.  THE  CONVEX  PROBLEM  AND  PRELIMINARIES 

We  consider  the  following  problem: 

Problem  I:  Given  m  +1  maps/°(.),  /l(.), ....  /"*(.) :  £*  -*£l  and  a  subset  0  of  £*  defined  by 

n»  {x  |  A r)<0  fori  -  1,  2, ...,  m}, 
find  a  point  or*  in  0  such  that  for  every  x  in  fl 

/V)  </>(*)• 

Definition  1:  Let  C  be  the  class  of  all  problems  of  the  form  of  Problem  I  in  which  the  maps 
A  ).  /*(•).  /"*(•)  are  such  that 

(0  f° (•)  i*  continuously  differentiable,  convex  and  radially  unbounded;  that  is,  /°(.)  is  such  that  given  any 
x  €  0,  to  every  scalar  a  corresponds  a  scalar  p  >  0  such  that  f°(x)  >  a  whenever  \\x  ||  >  p, 

(ii)  /*(.),  i  »  L  2, ...,  m,  are  continuously  differentiable,  convex  and  define  a  nonempty  constraint  set  Q 
and 

(iii)  the  set  (I  satisfies  the  Kuhn-Tucker  constraint  qualification  at  every  solutionx* . 

It  is  known  that  the  following  necessary  conditions  of  optimality  are  also  sufficient  for  any  problem 

in  C. 

Optimality  Conditions:  A  point  x*  e  0  is  an  optimal  solution  to  a  problem  in  C  if  and  only  if  there  exists  a 
vector  p€E%  with  components  ft  >  0,  i  -  1,  2, ...,  m,  such  that 
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V/V)‘  *  £  *W)‘  =  0 
1-1 

HiT{x‘)  =  0,  fori  *  1,  2, ....  m. 

III.  THE  PARAMETRIZED  GRADIENT  PROJECTION  ALGORITHM 

Given  a  point  x  €  0  and  the  parameter  e  >  0,  define  the  e-active  constraints  index  set  7  (x,  e)  as 

r(x,e)  =  {/  |  f(x)  >  -c} , 

given  a  subset  7  of  the  set  { 1,  2, ....  m },  define  the  subset  f) (7)  of  £*  as 

CUD  *  /  Cx  I  /*'(y)  -  0  f°r  ahi  e/}  if  /  f  b 
* ’  if  7  =  ^ , 

and  given  a  point  x  6  0  and  an  index  set  7,  let  w  (x,7)  be  the  projection  of  x  -  V/°(x)‘  onto  the  set  fl(7),  that 
is.  iv  (x,/)  satisfies 

||Cr-V/0(x)‘)-H-(x./)||  =  min{||(x- V/>(x )*)->- 1|  |yefl(/)}. 

> 

We  only  consider  problems  in  C,  and  thus  0(7)  is  non-empty  and  convex,  and  the  quantity  >v  (x,I)  is 
well  defined.  Using  the  notation  just  defined,  we  may  rewrite  the  optimality  conditions  as: 

Lemma  1:  If  a  point  x"  eO  is  optimal  for  a  problem  in  C  then  w(x‘j(x',e))  =  x*  for  all  e>0.  Con¬ 
versely,  ifx*  and  *  satisfy  w  (x*,/(x’,e))  =x',thenx*  is  optimal. 

We  now  gl/e  the  parametrized  gradient  projection  algorithm  to  solve  a  Problem  in  the  class  C. 
Algorithm  2:  Let  x 1  e  0  and  *  >  0  be  given. 

Step  0:  Set  J fc  =*  1. 

Compute  V/1(x1),/(x1  ,e)  =»  {i  |  f(xl)  >  -e]  and/*1  =  > v(xl,/(x\e)) -x1  . 

Step  1:  If  p*  ■  0  stop;  eise  go  to  SicH  2. 

Step  2:  Set  d*  = /?*. 

Step  3:  Compute  a*  >  0  to  minimize  /°(x*  +  a*d*)  subject  to  (x*  +  a*d*)  e  IL 
Step  4:  Let  x**1  =  x*  +  a*d\ 

StepS:  Compute  V/°(x**‘). 

Step  6:  Compute  7(r**‘,«)  *  {i  | /’(x**1)  > -«}. 

Step  7:  Compute/***1  =  w(x**l,/(x**l,«))  -x*+I . 

Step  8:  Set  *  =■>  A:  +  1  and  go  to  Step  1. 

Remark  1:  Although  not  explicitly  required  in  the  presentation  of  Algorithm  2,  we  introduce  the  quantity 
pk  in  order  to  facilitate  a  later  modification  which  results  in  our  second  algorithm. 

At  each  iteration.  Algorithm  2  generates  a  search  direction  d*  which  is  computed  by  projecting 
the  negative  gradient  onto  the  set  0(7  (x*,*)).  The  parameter  r  defines  the  "sufFidently-active"  constraint 
region.  If  a  point  x*  is  in  that  region,  then  7(r*,«)  ti  and  the  corresponding  search  direction  d*  is 
obtained  by  projecting  x*  -  Vf°(xkY  onto  the  set  [y  |  /*(y)  <  0  for  all  i€7(x‘,e)};  otherwise,  d*  = 
-V/°(r*)*.  For  c  large  enough,  0(7(x*,«))  ■  fl  for  all  k.  Thus,  by  choosing  t  large,  Algorithm  2  reduces  to 
projecting  x*  -  onto  the  entire  set  0  at  each  iteration.  For  c  -  0,  0(/(x‘,e))  *  £“  whenever 

x*  €  interior(n).  Thus,  when  «  ■  0,  Algorithm  2  reduces  to  a  projected  version  of  best-step  steepest  des¬ 
cent  in  which  the  direction  chosen  at  an  interior  point  is  the  negative  gradient. 

We  now  show  that  using  any  <  >  0  in  Algorithm  2  will  always  produce  a  solution  to  a  problem  in 
C,  even  though  the  direction  finding  map  used  to  compute  d*  is  not  dosed. 
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Lemma  2:  If  x  G  ft  is '  't  a  solution  to  a  given  problem  in  C  and  e  '  then  there  exists  an  e(x)  >  0  and 
S(x)  >  0  such  that 

f°(y  +  ad)<f°(y)  -6(x)  for  ally  Gf?  (x,e(x))  Oft. 

Theorem  1:  If  {x*}  is  a  sequence  constructed  by  Algorithm  2  to  solve  a  problem  in  C  then  either  {x*}  is 
finite  and  the  last  point  is  optimal  or  {x*}  is  infinite  and  contains  cluster  points,  each  of  which  is  optimal. 
Furthermore,  when  a  problem  in  C  is  such  that  /°(.)  is  strictly  convex,  then  that  problem  has  a  unique 
optimal  soludonx*,  and  in  that  case  {x*}  converges  tox*. 

In  order  to  present  the  parallel  implementation  of  Algorithm  2  used  to.  solve  the  LQR  problem, 
we  first  use  it  to  solve  a  particular  case  of  Problem  1,  specifically  the  bounded  variable  quadratic  program¬ 
ming  problem  given  below. 

Problem  2:  Fmdx  G  £*  with  components x;,  /  =  1,  2, ....  n,  that  minimizes  the  performance  index  /°(x)  = 
1/2  x‘Hx  +  h*x  +  c  subject  to  x  G  ft,  where  H  is  an  n  x  n  symmetric  positive  definite  matrix,  b  is  an  n  x  1 
vector  and  ft  is  the  unit  hypercube  defined  as  ft  =  (x  |  |  x;  |  <  1,  for  i  =  1,  2, ...,  n  }. 

Since  H  is  positive  definite,  the  cost  function  f°(.)  is  strictly  convex.  The  nonempty  constraint  set 
ft  is  the  intersection  of  the  2 n  linear  inequalities  /*(x)  ■  xj  - 1  <  0  and  f  *%(x )  =  -x,  - 1  <  0  for  i  =  1,  2, ..., 
n.  Therefore,  Problem  2  lies  in  C  and  by  Theorem  1,  Algorithm  2  will  generate  a  sequence  which  con¬ 
verges  to  the  unique  solution x*. 

We  now  give  some  properties  of  implementing  Algorithm  2  to  solve  Problem  2. 

Lemma  3:  Given  x*  and  V/°(x*),  let  pk  =  w  (x*/(x*,«))  -x*  for  the  case  of  Problem  2.  Then  each  com¬ 
ponent  pk  of  pk  can  be  computed  as 

if  i  G  /  (x*,s)  and  x?  -  V/?(x*)*  >  1 

ifi+/tG/CrV)andx^-V/?(x*)‘<-l  (3.1) 

otherwise, 

where  we  denote  V/J(x*)  to  be  the  i-th  component  of  V/°(r‘). 

Remark  2:  If  *>2  then  7(x*»  =  (1,  2,  ...,  7n}  for  all  k.  Consequently,  w(x*,/(x*,e))  =  SA7(x*  - 
V/)(x*)‘)  and  Algorithm  2  reduces  to  projecting  onto  the  entire  set  ft  at  each  iteration. 

Lemma  4:  Given  x*  and  V/°(x*),  let  a*  minimize  f°(xk  +  a*d*)  subject  to  (x*  +  a*dk)  G  ft  for  the  case  of 
Problem  2.  Then  a*  can  be  computed  as  a*  =*  minja^a, },  where 

a.  -  -  <d*.  V/°(x*)‘  >  /  <d\  Hdk  > ,  (3.2) 

SBtt  •  X* 

0^  -  min  {a  |  a  »  - - - —  for  all  i  such  that  dk  ^0}.  (3.3) 

a  d* 

Remark  3:  Since  the  cost  function  is  quadratic  and  we  are  using  the  update  x*  + 1  =  x*  +  o*d*,  the  quan¬ 
tity  V/°(x**1)*  can  be  updated  using  V/°(x**I)‘  =  V/° (x*)‘  +  a kHdk.  Consequently,  we  need  only  com¬ 
pute  the  gradient  directly  in  Step  0  of  Algorithm  2,  and  then  use  the  gradient  update  at  each  subsequent 
iteration. 

We  now  give  the  following  implementation  of  Algorithm  2  to  solve  Problem  2. 

Algorithm  3:  Let  x1  and  t  >  0  be  given. 

Step  0:  Set  k  ■  1. 

Compute  '5y°(r1)*  *  Hx1  +h,/(x1,e)  ■  {j  | x}-l  >  •*}  U  {i  +n  |  -x,*-l>-«}  andp1  using  Eq.  (3.1). 
Step  1:  If  lip*  ||2  <  *,  then  stop;  else  go  to  Step  2. 
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Step  2:  Set  dk  =  ph. 

Step  3:  Compute  at  min  {a^.a*},  according  to  Eqs.  (3.2)  and  (3.-,. 

Step  4:  Letx**1  =  xk  +  cfdk. 

Step  5:  Let  y/°(r*+l)‘  -  V/°(x‘)‘  +  <fHdk. 

Step  6:  Compute  /  (x*  *l,«)  »  {i  |  xf  *l  - 1  >  -e}  U  {/  +n  |  -x<*l-l>-e}. 

Step  7:  Compute  pk  * 1  using  Eq.  (3.1). 

Step  8:  Set  k  =  k  +  1  and  go  to  Step  1. 

In  Section  IV  we  modify  Algorithm  3  to  obtain  improved  convergence  properties  when  a  problem 
in  C  is  linearly  constrained. 

IV.  FLETCHER-REEVES  MODIFICATION  FOR  LINEARLY  CONSTRAINED  PROBLEMS 


We  now  consider  the  following  linearly  constrained  class  of  problems. 

Definition  2:  Let  L  be  the  class  of  all  problems  of  the  form  of  Problem  1  in  which  the  maps 
/°(  ).  /l(-), /"*(■)  are  such  that 

(i)  /°(.)  is  continuously  differentiable,  convex  and  radially  unbounded, 

(ii)  f(x)  =  <g*,  x  >  -  hi,  for  i  =  1, 2, ...,  m  define  a  nonempty  constraint  set  0  and 

(iii)  for  any  x  6  0,  g‘  for  i  €  /  (x,  0)  are  linearly  independent. 

Givenx1  6  fl,  let  7(x*,0)  be  the  index  set  defined  by  7(r*,0)  =  {»'  |  <g*,  x  >  -  h,  =  0}  and  let  Gk 
be  the  corresponding  matrix  whose  rows  are  (g')‘  for  all  i  6/(x*,0).  The  projection  matrix  Pk  associated 
with  the  point  x*  is  Pk  =  7  -  Let  A7(7(x*,0))  be  the  linear  manifold  defmed  by 


Af(/(x\0)) 


{x  !  <g*,x>  -hi  =  0  for  all/  €/(r*,0)}  if  7(x*,0)  f 
E*  if  7(x',0)  =«  <p. 


Then  we  may  compute  the  projection  of  V/°(x*)  onto  the  set  A/(7(x*,0))  asp  =  *P*V/°(xi)‘. 

Our  approach  to  solving  a  problem  in  L  is  to  approximate  /°(.)  as  a  quadratic  and  generate  a  new 
direction  dk*1  which  is  conjugate  with  respect  to  the  previous  direction  dk  whenever  possible.  We  com¬ 
pute  the  initial  direction  d1  =  pl  and  subsequent  directions  dk  using  dk  =  pk  +  /9*d*'1,  where  f  - 
\\pk  ||2/ lip*'1 1|2.  However,  rather  than  projecting x*  -  V/°(x*)‘  onto  the  active  contstraint  set  M(7(x\  0)), 
we  project  onto  the  approximation  set  ft(/(r *,«)),  that  is,  we  compute  pk  =  w  (x*,7(x*,e))  -  x*.  Conse¬ 
quently,  we  may  produce  ad**1  which  is  conjugate  to dk  whenever Pk+i  -  Pk  and pk * 1  »  pk * 1 ;  otherwise 
we  restart  the  algorithm  with  d**1  ■  pk*1-  Furthermore,  if /°(.)  is  not  quadratic,  the  algorithm  should  be 
restarted  at  least  every  n  -  |  7(x**l,0)  |  steps  as  a  spacer  step  to  ensure  global  convergence. 

We  now  give  the  parametrized  gradient  projection  algorithm  with  the  Fletcher-Reeves 
modification  to  solve  a  problem  in  the  class  L. 

Algorithm  4:  Let  x1  and  t  >  0  be  given. 

Step  0:  Set  k  *  k  *  1  and  d°  =  0. 

Compute  V/V),  /(xl,«)  =  {/  \f{xx)>-€),  /(x‘,0)  =  {i  \f{xl)  =  0} 

pl  »  w(rl/(r1,i»-ri. 

Step  1:  Ifp*  -  0  then  stop;  else  go  to  Step  2. 

Step  2:  If  dk  i  *  0  then  let  d*  -  pk  and  go  to  Step  5;  else  go  to  Step  3. 

Step  4:  Letd*  *  pk  +  £*d*‘l. 


and 


Step  5:  Compute  a*  >  0  to  minimize /°(x*  +  a*d*)  subject  to  (xk  *■  €  Q. 

Step  6:  Let  at**1  =x  a*d*. 

Step  7:  Compute  V/°(x*M). 

Step8:  Compute =  {/  |  f(xk*1)  >  -e}. 

Step 9:  Computep**1  =  w(xk*l,I  (xk¥l,e)) -x**1. 

Step  10:  Compute /£r*  +  1,0)  =*  {/  |  f'(xh*1)  =  0}. 

Siw^i  11:  Compute p**1  =  -/>*♦lV/<,(r*  +  1),. 

Step  12:  lf/(x*  +  1,0)  )*/ (x*,0)  orp**1  ^p*+l  ork  >  n  •  |/(x*  +  1,0)|  then  set 
dk  =  k  =  0. 

Step  13:  Set  k  =  k  +  1,  k  =  fc  +  1  and  go  to  Step  1. 

When  /°(.)  is  a  quadratic.  Algorithm  4  generates  a  new  direction  dk  +  l  which  is  conjugate  to  dk 
whenever  the  point  xk*1  is  in  the  same  active  constraint  set  M (/(x*,0)  as  x*  and  p*  *l  =  p  '  .  However, 
since p*  +  l  is  obtained  by  projecting  onto  the  set  0(I(xk*l,t))  rather  than  A/(/(x\0)),  Algorithm  4  is  not  a 
pure  active  set  method.  Thus,  Algorithm  4  may  change  constraint  sets  before  the  minimizer  on  the  current 
constraint  set  is  found  and  consequently  does  not  exhibit  finite  convergence  in  the  case  of  quadratic  cost. 

Theorem  3:  If  {xk}  is  a  sequence  constructed  by  Algorithm  4  to  solve  a  problem  in  L,  then  either  {x*}  is 
finite  and  the  last  point  is  optimal  or  {x*}  is  infinite  and  contains  cluster  points,  each  of  which  is  optimal. 
Furthermore,  when  a  problem  in  L  is  such  that  /°(.)  is  strictly  convex,  then  that  problem  has  a  unique 
optimal  solution x*,  and  in  that  case  {x*}converges  tox*. 

We  now  give  a  slight  modification  to  Algorithm  4  which  allows  the  resulting  algorithm  to  achieve 
finite  convegence  when  used  to  solve  problems  with  quadratic  cost. 

Algorithm  5:  Let  x1  and  t  >  0  be  given. 

Step  1:  Use  a  conjugate  gradient  method  to  find  the  minimizer  x*  for  the  unconstrained  version  of  the 
problem. 

Step  2:  Ifx*  €  fl  then  stop;  else  go  to  Step  3. 

Step  3:  Compute  x J  to  be  the  projection  ofx*  onto  the  set  ft. 

Step  4:  Use  Algorithm  4  withxj  as  the  starting  point  to  compute  the  solution  of  the  problem. 

Corollary  1:  If  {x*}  is  a  sequence  constructed  by  Algorithm  5  to  solve  a  problem  in  L  with  /°(.)  a  qua¬ 
dratic  in  x  with  postitive  definite  Hessian,  then  {x*}  is  finite  and  the  last  point  is  optimal. 

V.  DISCRETE  LQR  CONTROL  PROBLEM  WITH  HARD  CONTROL  BOUNDS 

In  this  section  we  rewrite  the  discrete  LQR  optimal  control  problem  with  hard  control  bounds  as 
a  bounded  variable  quadratic  programming  problem  and  then  show  how  the  gradient  and  step  length  com¬ 
putations  may  be  organized  to  share  common  terms. 

Problem  3:  Given  an  m-input,  discrete,  time-varying  linear  system  in  which  we  are  given  the  initial  state, 
za  €  £",  and 

Zi  =  Aa.i  +  BiUi,  i  =»  1,  2, ....  N,  (5.1) 

where  for  i  -  0,  L,  ...,  N,  z,  e  £*  is  the  state  of  the  system  at  time  i  and  for  i  -  L,  2,  ...,  N,  u,  e  Em  is  the 
control  at  time  i  with  components  u^,  j  -  L,  2,  ...,  m,  find  the  mN  control  vector  u  =  («*,  u*2, . . . ,  u*)‘ 
that  minimizes  the  performance  index 

/(«)  3  y  E  [ziQA  +  «*£.«.), 

and  satisfies 


u  eft, 


-  J>  b  - 

where  for  i  =  1,  2,  .  V  Q,  are  n  xn  symmetric  positive  semi-defi' '  ’  matrices,  R,  are  m  x  m  symmetric 
positive  definite  mau.-cs  and  the  convex  compact  set  fi  is  defined  as 

0  =  {«  |  |  Uij  |  <  1,  for  i  =  L,  2, N,  j  =  1,  2, m }. 

Using  the  notation  introduced  in  [MEY86],  Problem  3  may  be  rewritten  as  a  bounded  variable 
quadratic  programming  problem  in  u: 

Problem  4:  Find  the  control  vector  u  £  EmM  that  minimizes  the  performance  index  J  (u)  =  1/2  u‘Hu  +  b‘u 
+  c  subject  to  u  g  H,  where  H  is  the  mN  x  mN  block  symmetric  positive  definite  matrix  with  block  size 
m  x  m  given  by//  =  (it  +  F\F*QF'tF%),  b  is  the  mN  x  1  vector  given  by  b*  =  z^F^^QF'^F,,  and  c  is  the 
scalar  given  by  c  =  \/2z‘!iF<aF'}QF,lFazo,  and  the  matrices  R,  Q,  F„  F.  and  F 0  are  defined  as:  Q  = 
Diag(Q i,  Q j,...,  Qtt),  is  the  nN  x  nN  block  matrix  that  consists  of  Nn  xn  symmetric  positive  semi-definite 
blocks  Qi,  R  =  Dij?(Rlt  F2,...,  Rn)  is  the  mN  xmN  block  matrix  that  consists  of  N  m  xm  symmetric 
positive  definite  blocks  F,  is  the  nN  x  nN  block  lower  bidiagonal  matrix  that  consists  olN1  n  xn  blocks 
[F^  and  F«  is  the  nN  x mN  block  diagonal  matrix  that  consists  of  N1  n  xm  blocks  ^F,j  defined  for  all  i 

and  j  in  [1,2,..  J*f]  by 


/ 

A, 

0 


ij*  => 

if  i  =  i  + 1, 

otherwise 


a- 


Bi 

0 


ifi  =; 
otherwise 


and  F0  is  the  nN  x  n  block  matrix  that  consists  of  N  n  xn  blocks  ^F^  defined  for  all  i  in  [1,2,...,//]  by 


H  * 


A  i  if  i  =  1 

0  otherwise . 


Since  Problem  4  is  of  the  form  of  Problem  2,  Algorithms  3  and  4  can  be  used  to  solve  Problem  4. 
We  now  use  the  approach  for  the  evaluation  ,'fg(ul),  or*  andgfu**1)  given  in  [MEY87b]  in  which  a*  and 

g  («  *  * 1 )  are  computed  by  sharing  common  terms,  where  we  use  the  notation  g(uk)  *  {dJ  /  du)*  . 

We  first  consider  the  computation  of  g  (u 1 ),  which  is  performed  only  once  in  Step  0  of  Algorithms 
3  and  4.  The  matrix  F,  is  non-singular  and  thus  we  may  write  the  gradient  as  g  (u 1 )  =  Ru 1  +  F*%F*Qz 1 , 
wherez1  satisfies  Eq.  (5.1).  Let  A 6 £“*  be  the  costate  vector  A  *  (A*,  A‘2( . . .  ,Ay)‘,  A,  €£*,  defined  by  A 
=  F*Qz.  Then,  given  u 1  and  Zq,  the  gradient  g(u l)  may  be  obtained  by  using  the  three  equations 

Fjz1  »  F.u1  +  FoZ0,  (5.2) 

F*. A*  -  Qz\  (5.3) 

g(ul)  »  Ru1  +  F^A1.  (5.4) 


As  a  consequence  of  Eqs.  (5.2),  (53)  and  (5.4)  we  obtain  the  gradient  evaluation  technique  pro¬ 
posed  by  Polak  (POL71J: 

Algorithm  6:  Given  u 1  and  z0. 

Step  1:  Compute  z1  such  that  F^1  *  F.u1  +  Fqz0. 

Step  i  Compute  A1  such  that  F^A1  =  £zl. 

Step  3:  Compute  g(ul)  *  Ru1  +  F*»Al. 

Due  to  the  lower  n  x  n  block  bidiagonal  structure  of  F„  Steps  1  and  2  of  Algorithm  6  require  pro¬ 
cedures  for  the  solution  of  N  stage  forward  and  reverse  n  xn  block  first-order  linear  recurrences,  respec¬ 
tively.  Such  procedures  are  discussed  in  Section  VI.  Given  A1,  Step  3  computes  g(u l)  by  computing  each 
of  the  N  uncoupled  components  g}  «  (dJ /dui)\.% •;  thus.  Step  3  exhibits  linear  speedup  when  executed  in 
parallel. 


Algorithms  3  and  4  both  use  the  common  term  **  =  Hdk  in  ,lie  computation  of  the  step  length  -/ 
and  the  next  gradien.  u**1),  We  now  show  that  x*  can  be  effiden. .  computed  by  writing  x*  as 

x*  =  Mk  +  F\FtQF,lF%dk, 

and  using  Algorithm  7  below,  where  we  note  that  instead  of  requiring  F',1  and  F*,  we  solve  linear  systems 
corresponding  to  F,  and  F*,. 

Algorithm  7:  Given  dk. 

Step  1:  Compute  p*  *  Rdk. 

Step  Z  Compute  <5*  =  F%dk. 

Step  3:  Compute  w*  such  that  F,wk  =  Sk. 

Step  4:  Compute  tj*  =  Qwk. 

Step  5:  Compute  vk  such  that  F*,vk  =  tf. 

Step  6:  Compute  x*  =  F\vk. 

Step  7:  Compute  x*  =  +  x*- 

With  the  exception  of  Steps  3  and  5,  Algorithm  7  computes  x*  by  executing  a  series  of  matrix- 
vector  products  followed  by  a  vector  sum.  Each  of  the  matrix-vector  products  consists  of  N  uncoupled 
block  matrix-vector  products,  thus  exhibiting  linear  speedup  when  implemented  in  parallel.  However,  due 
to  the  structure  of  F„  Steps  3  and  5  require  the  solution  of  N  stage  forward  and  reverse  n  xn  block  first- 
order  linear  recurrences,  respectively.  As  in  the  case  of  Algorithm  6,  this  again  suggests  the  need  for 
parallel  procedures  to  solve  linear  recurrences. 

VI.  PARALLEL  ALGORITHMS  FOR  OPTIMAL  CONTROL  PROBLEMS 

The  model  of  SIMD  parallel  computation  that  we  use  consists  of  a  global  parallel  memory,  p 
parallel  processors,  and  a  control  unit,  where  all  processors  perform  the  same  operation  at  each  time  step. 
We  further  simplify  the  model  by  making  the  following  assumptions: 

Al.  Each  computational  operation  requires  the  same  amount  of  time,  referred  to  as  a  step. 

A2.  There  are  no  accessing  conflicts  in  global  memory. 

A3.  All  initial  data  resides  in  global  memory. 

A4.  There  is  no  time  required  to  access  global  memory. 

We  use  the  parallel  procedures  FORWARD  and  REVERSE  given  in  [MEY87b]  to  solve  the  for¬ 
ward  and  reverse  V  stage  n  xn  block  first-order  linear  recurrence  systems  that  are  required  by  Algorithms 
6  and  7.  The  procedures  are  blocked  versions  of  the  parallel  scalar  approach  given  in  [MEY87a]  and  are 
formulated  as  a  function  of  the  number  p  of  processors  so  that  the  algorithm  structure  is  fixed  whenever 
the  number  of  stages  N>(p  /nf.  We  then  use  the  parallel  procedures  GRADIENT  and  DIRECTION 
also  given  in  [MEY87b]  to  obtain  parallel  implementations  of  Algorithms  6  and  7. 

We  next  give  the  following  parallel  implementation  of  computing  the  quanities  pk  - 
w(u*,/(u*,#)) -u*  andp*  m -Pkg(uk). 

I.  PROCEDURE  PROJECTIONl(u\  g(uk)) 

Z  FORALL i  €{l,2,..nAf},;  €{l,2,..^n}  DO  IN  PARALLEL^  :-  -&,<«*); 

3.  FORALL  i  €  {1,2 ,-,N},  j  6  {l,2,..yw }  DO  IN  PARALLEL 

4.  IF  Uij  - 1  >  ■€  AND  ujjy -&/(«*)  >  1  THEN />£,  :-  + 1  -  u*y; 

5.  EF -Uij  - 1  > •«  AND «<j  - <  -lTHENp^:-  -1-u^; 

6.  END  FORALL 

7.  RETURN  p*; 

%.  END  PROCEDURE 


1.  PROCEDURE  PROJECTlON2(u*,  g(uk)) 

2.  FORALL  i  €  {  j  6  {1,2,... ,m  }  DO  IN  PARALLEL /  .=  &,/»*); 

3.  /(a*,0):-  * 

4.  FORALL  /  €  {1,2,...,N},  j  <E  {1,2, ...,m }  DO  IN  PARALLEL 

5.  IFutj  -l  *  0  THENp*  0,/(u\0)  :=■  / (u*,0)  u  (/,/); 

6.  IF  - 1  »  OTHENp*  :=  0,/(u*,0):  =  /(u*,0)  u  (/,;); 

7.  END  FORALL 

8.  RETURN  p*,  7(«*,0); 

9.  END  PROCEDURE 

We  now  embed  the  parallel  procedures  GRADIENT,  DIRECTION,  and  PROJECTION  1  to 
obtain  a  parallel  implementation  of  Algorithm  3  to  solve  Problem  4  and  we  then  give  the  corresponding 
number  of  steps  required  for  one  iteration  using p  processors. 

1.  PROCEDURE  PGPMfzo.u1) 

2.  1; 

3.  g(u‘):  =  GRADIENTfro.w1); 

4.  p1  PROJECTION^1,  gCu1)); 

5.  WHILE  |b*  II2  >  *.  DO 

6.  d“:-pk; 

7.  **  :=  DIRECTIONS*); 

8.  a*  min  {a,,  c^}; 

9.  FORALL  /  6  { 1, 2, ...,//}  DO  IN  PARALLEL  u}* 1 :  =  uf  +  ; 

10.  FORALL  /  £  {1,2,...,N}  DO  IN  PARALLEL  gf*1  :=  g?  + 

11.  p*  +  1  :  =  PROJECnONl(u*tl,  g(u*  +  1)); 

12.  k  :■  Jr  +■  1; 

14.  END  WHILE 

15.  END  PROCEDURE 

Theorem  4:  Givens,  n l,  jV,  n,  nj  and/*  such  thatp  -  lorp/n  is  an  integer,  the  number  of  steps  required 
by  one  iteration  of  procedure  PGPM  using p  processors,  1  <p<  nN 1/1 ,  is 


T(Nji,mp) 


■|(6n  +  2m-2)  + 


yN(2n  +2m  +  22) --y-  +  4 iog# 


if  1  <p  <  n 


—  (2n+2m-2)  +  —  (2n +2m +22)  +  -N-'\ 
P  P  ( p/ny-\ 


8<p  vi)  +  4 log#  if  n  <  p  <  nN x/*. 


We  next  embed  the  parallel  procedures  GRADIENT,  DIRECTION,  PROJECTION  1  and  PRO¬ 
JECTION  to  obtain  the  following  parallel  implementation  of  Algorithm  4  to  solve  Problem  4  and  we  then 
give  the  corresponding  number  of  steps  required  for  one  iteration. 

1.  PROCEDURE  PGPFRM(20,u l) 

2.  k  :=*  1; 

3.  g(ul)  :*  GRADIENTbo.H1); 

4.  p1  :>  PROJECTIONlb1,  g(u 1)); 

5.  WHILE  |b*ll2  >  «,  DO 

6.  IF  dk  l  *  0  THEN  d*  :  *  p* 

7.  ELSE0*=  lb*  II2/  lb*'1  II2 

FORALL/  €  {L2,...,N}  DO  IN  PARALLEL  d?  p?  +  fdtl; 

8.  END  IF 

9.  «*:- DIRECnON(d*); 

10.  a*  min  {a,,  a,}; 

11.  FORALL/ €{1, 2,. ..,N}  DO  IN  PARALLEL uf  * 1  :=  u*  +  a*df; 

12.  FORALL/€{l,2,...,iV}  DO  IN  PARALLEL gf*1  :=  gf  +  a***; 

13.  p**1  :«  PROJECnONl(u*“,g(u*)-*-l); 
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p'"  :=  PROJECTION2(m* *\  g(u**'l); 

rsni.k'im  <*  m  r»o  „*  * 1  ±  n*  * 1  T 


15.  IF/(u**l,0).  t\0)OR pktl  fp  THEN d*  :  =  0; 

16.  k  :  =  fc  1; 

17.  END  WHILE 

18.  END  PROCEDURE 

Theorem  5:  Given  z0,  u l,  N,  n,  m  and  p  such  that  p  *  1  or  p  /n  is  an  integer,  the  number  of  steps  required 
by  one  iteration  of  procedure  PGPFRM  using p  processors,  1  <p  <  n/Vv*,  is 


*  —  (2n  +2m  +34).  —  +  7Zoj*> 
P  P  P 


if  1  </>  </i 


—  (2n+2m-2)+  —  (2» +2m +34)  ♦  V'\  Kp-n)  +  Vog#  JnKpKnN^. 

p  P  (p/«r-t 


The  speedup  5,  and  efficiency  Ep  for  procedures  PGPM  and  PGPFRM  can  be  obtained  directly 
from  Theorems  4  and  5.  A  straightforward  computation  shows  that  Sf  and  EP  are  bounded  from  below  by 
Sp  >  O.tyt  and  £f  >  0.6. 

VII.  CONCLUSIONS 


In  this  paper  two  parallel  algorithms  have  been  presented  to  solve  the  discrete  LQR  optimal  con¬ 
trol  problem  with  hard  control  bounds.  The  algorithms  exhibit  fast  convergence  coupled  with  a  high 
degree  of  parallelism  at  each  iteration,  making  them  suitable  for  real-time  online  implemenation  on  an 
SIMD  machine.  Moreover,  the  algorithms  possess  the  desirable  property  that  their  structure,  and  hence 
parallelism,  is  determined  by  the  number  of  available  processors.  Thus,  unlike  approaches  in  which  the 
structure  of  the  procedure  changes  with  problem  size,  the  procedures  presented  in  this  paper  maintain  the 
same  computational  and  interprocessor  communication  requirements  independently  of  the  number  of 
stages  in  the  control  problem.  Although  not  considered  in  this  paper,  interprocessor  communication 
requirements  should  not  be  a  critical  performance  factor  in  view  of  the  implementation  results  for  parallel 
linear  recurrence  solvers  presented  in  [MEY87a]. 

Finally,  we  note  that  the  parallel  procedures  presented  in  this  paper  may  be  used  to  solve  con¬ 
strained  discrete  optimal  control  problems  which  involve  nonquadratic  cost  and  nonlinear  dynamics.  In 
that  case  one  uses  suitable  approximations  in  which  the  system  dynamics  are  linearized  and  the  cost  is 
approximated  quadratically. 
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